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Neural Networks and Statistical Learning 


This textbook introduces neural networks and machine learning in a statisti- 
cal framework. The contents cover almost all the major popular neural network 
models and statistical learning approaches, including the multilayer perceptron, 
the Hopfield network, the radial basis function network, clustering models and 
algorithms, associative memory models, recurrent networks, principal compo- 
nent analysis, independent component analysis, nonnegative matrix factoriza- 
tion, discriminant analysis, probabilistic and Bayesian models, support vector 
machines, kernel methods, fuzzy logic, neurofuzzy models, hardware implemen- 
tations, and some machine learning topics. Applications of these approaches to 
biometric/bioinformatics and data mining are finally given. This book is the first 
of its kind that gives a very comprehensive, yet in-depth introduction to neural 
networks and statistical learning. 

This book is helpful for all academic and technical staff in the fields of neu- 
ral networks, pattern recognition, signal processing, machine learning, computa- 
tional intelligence, and data mining. Many examples and exercises are given to 
help the readers to understand the material covered in the book. 
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Preface 


The human brain, consisting of nearly 10!! neurons, is the center of human 
intelligence. Human intelligence has been simulated in various ways. Artificial 
intelligence (AI) pursues exact logical reasoning based on symbol manipulation. 
Fuzzy logics model the highly uncertain behavior of decision making. Neural 
networks model the highly nonlinear infrastructure of brain networks. Evolu- 
tionary computation models the evolution of intelligence. Chaos theory models 
the highly nonlinear and chaotic behaviors of human intelligence. 

Softcomputing is an evolving collection of methodologies for the representa- 
tion of the ambiguity in human thinking; it exploits the tolerance for impreci- 
sion and uncertainty, approximate reasoning, and partial truth in order to achieve 
tractability, robustness, and low-cost solutions. The major methodologies of soft- 
computing are fuzzy logic, neural networks, and evolutionary computation. 

Conventional model-based data-processing methods require experts’ knowl- 
edge for the modeling of a system. Neural network methods provide a model-free, 
adaptive, fault tolerant, parallel and distributed processing solution. A neural 
network is a black box that directly learns the internal relations of an unknown 
system, without guessing functions for describing cause-and-effect relationships. 
The neural network approach is a basic methodology of information processing. 
Neural network models may be used for function approximation, classification, 
nonlinear mapping, associative memory, vector quantization, optimization, fea- 
ture extraction, clustering, and approximate inference. Neural networks have 
wide applications in almost all areas of science and engineering. 

Fuzzy logic provides a means for treating uncertainty and computing with 
words. This mimics human recognition, which skillfully copes with uncertainty. 
Fuzzy systems are conventionally created from explicit knowledge expressed in 
the form of fuzzy rules, which are designed based on experts’ experience. A 
fuzzy system can explain its action by fuzzy rules. Neurofuzzy systems, as a 
synergy of fuzzy logic and neural networks, possess both learning and knowledge- 
representation capabilities. 

This book is our attempt to bring together the major advances in neural net- 
works and machine learning, and to explain them in a statistical framework. 
While some mathematical details are needed, we emphasize the practial aspects 
of the models and methods rather than the theoretical details. To us, neural 
networks are merely some statistical methods that can be represented by graphs 
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and networks. They can iteratively adjust the network parameters. As a statis- 
tical model, a neural network can learn the probability density function from 
the given samples, and then predict, by generalization according to the learnt 
statistics, outputs for new samples that are not included in the learning sample 
set. 

The neural network approach is a general statistical computational paradigm. 
Neural network research solves two problems: the direct problem and the inverse 
problem. The direct problem employs computer and engineering techniques to 
model biological neural systems of the human brain. This problem is investigated 
by cognitive scientists and can be useful in neuropsychiatry and neurophysiology. 
The inverse problem simulates biological neural systems for their problem-solving 
capabilities for application in scientific or engineering fields. Engineering and 
computer scientists have conducted extensive investigation in this area. This 
book concentrates mainly on the inverse problem, although the two areas often 
shed light on each other. The biological and psychological plausibility of the 
neural network models have not been seriously treated in this book, though 
some backgound material is discussed. 

This book is intended to be used as a textbook for advanced undergraduates 
and graduate students in engineering, science, computer science, business, arts, 
and medicine. It is also a good reference book for scientists, researchers, and 
practitioners in a wide variety of fields, and assumes no previous knowledge of 
neural network or machine learning concepts. 

This book is divided into twenty-five chapters and two appendices. It contains 
almost all the major neural network models and statistical learning approaches. 
We also give an introduction to fuzzy sets and logic, and neurofuzzy models. 
Hardware implementations of the models are discussed. Two chapters are ded- 
icated to the applications of neural network and statistical learning approaches 
to biometrics/bioinformatics and data mining. Finally, in the appendices, some 
mathematical preliminaries are given, and benchmarks for validating all kinds of 
neural network methods and some web resources are provided. 
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London, especially Anthony Doyle and Grace Quinn for their enthusiastic and 
professional support throughout the period of manuscript preparation. 
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Introduction 


Major events in neural networks research 


The discipline of neural networks models the human brain. The average human 
brain consists of nearly 101! neurons of various types, with each neuron con- 
necting to up to tens of thousands synapses. As such, neural network models are 
also called connectionist models. Information processing is mainly in the cere- 
bral cortex, the outer layer of the brain. Cognitive functions, including language, 
abstract reasoning, and learning and memory, represent the most complex brain 
operations to define in terms of neural mechanisms. 

In the 1940s, McCulloch and Pitts [27] found that a neuron can be modeled as 
a simple threshold device to perform logic function. In 1949, Hebb [14] proposed 
the Hebbian rule to describe how learning affects the synaptics between two 
neurons. In 1952, based upon the physical properties of cell membranes and the 
ion currents passing through transmembrane proteins, Hodgkin and Huxley [15] 
incorporated the neural phenomena such as neuronal firing and action potential 
propagation into a set of evolution equations, yielding quantitatively accurate 
spikes and thresholds. This work brought Hodgkin and Huxley a Nobel Prize in 
1963. In the late 1950s and early 1960s, Rosenblatt [32] proposed the perceptron 
model, and Widrow and Hoff [39] proposed the adaline (adaptive linear element) 
model, trained with a least mean squares (LMS) method. 

In 1969, Minsky and Papert [28] proved mathematically that the perceptron 
cannot be used for complex logic function. This substantially waned the interest 
in the field of neural networks. During the same period, the adaline model as 
well as its multilayer version called the madaline was successfully used in many 
problems; however, they cannot solve linearly inseparable problems due to the 
use of linear activation function. 

In the 1970s, Grossberg [12, 13], von der Malsburg [38], and Fukushima [9] con- 
ducted pioneering work on competitive learning and self-organization, inspired 
from the connection patterns found in the visual cortex. Fukushima proposed his 
cognitron [9] and neocognitron models [10], [11], under the competitive learning 
paradigm. The neocognitron, inspired by the primary visual cortex, is a hierar- 
chical multi-layered neural network specially designed for robust visual pattern 
recognition. Several linear associative memory models were also proposed in that 
period [22]. In 1982, Kohonen proposed the self-organization map (SOM) [23]. 
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The SOM adaptively transforms incoming signal patterns of arbitrary dimensions 
into one- or two-dimensional discrete maps in a topologically ordered fashion. 
Grossberg and Carpenter [13, 4] proposed the adaptive resonance theory (ART) 
model in the mid-1980s. The ART model, also based on competitive learning, is 
recurrent and self-organizing. 

The Hopfield model introduced in 1982 [17] ushered in the modern era of neural 
network research. The model works at the system level rather than at a single 
neuron level. It is a recurrent neural network working with the Hebbian rule. 
This network can be used as an associative memory for information storage and 
for solving optimization problems. The Boltzmann machine [1] was introduced 
in 1985 as an extension to the Hopfield network by incorporating stochastic 
neurons. Boltzmann learning is based on a method called simulated annealing 
[20]. In 1987, Kosko proposed the adaptive bidirectional associative memory 
(BAM) [24]. The Hamming network proposed by Lippman in the mid-1980s [25] 
is based on competitive learning, and is the most straightforward associative 
memory. In 1988, Chua and Yang [5] extended the Hopfield model by proposing 
the cellular neural network model. The cellular network is a dynamical network 
model and is particularly suitable for two-dimensional signal processing and VLSI 
implementation. 

The most prominent landmark in neural network research is the backpropa- 
gation (BP) learning algorithm proposed for the multilayer perceptron (MLP) 
model in 1986 by Rumelhart, Hinton, and Williams [34]. Later on, the BP algo- 
rithm was discovered to have already been invented in 1974 by Werbos [40]. In 
1988, Broomhead and Lowe proposed the radial basis function (RBF) network 
model [3]. Both the MLP and the RBF network are universal approximators. 

In 1982, Oja proposed the principal component analysis (PCA) network for 
classical statistical analysis [30]. In 1994, Common proposed independent com- 
ponent analysis (ICA) [6]. ICA is a generalization of PCA, and it is usually used 
for feature extraction and blind source separation (BSS). Since then, many neu- 
ral network algorithms for classical statistical methods, such as Fisher’s linear 
discriminant analysis (LDA), canonical correlation analysis (CCA), and factor 
analysis, have been proposed. 

In 1985, Pearl introduced the Bayesian network model [31]. The Bayesian 
network is the best known graphical model in AI. It possesses the characteristic of 
being both a statistical and a knowledge-representation formalism. It establishes 
the foundation for inference of modern AI. 

Another landmark in the machine learning and neural network communities is 
the support vector machine (SVM) proposed by Vapnik et al. in the early 1990s 
[37]. The SVM is based on the statistical learning theory and is particularly 
useful for classification with small sample sizes. The SVM has been used for 
classification, regression and clustering. Thanks to its successful application in 
the SVM, the kernel method has aroused wide interest. 

In addition to neural networks, fuzzy logic and evolutionary computation are 
two other major softcomputing paradigms. Softcomputing is a computing frame- 
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work that can tolerate imprecision and uncertainty instead of depending on 
exact mathematical computations. Fuzzy logic [41] can incorporate the human 
knowledge into a system by means of fuzzy rules. Evolutionary computation 
[16, 35] originates from Darwin’s theory of natural selection, and can optimize 
in a domain that is difficult to solve by other means. These techniques are now 
widely used to enhance the interpretability of the neural networks or to select 
optimum architecture and parameters of neural networks. 

In summary, the brain is a dynamic information processing system that evolves 
its structure and functionality in time through information processing at differ- 
ent hierarchical levels: quantum, molecular (genetic), single neuron, ensemble of 
neurons, cognitive, and evolutionary [19]: 


e At a quantum level, particles, that constitutes every molecule, move contin- 
uously, being in several states at the same time that are characterized by 
probability, phase, frequency, and energy. These states can change following 
the principles of quantum mechanics. 

e At a molecular level, RNA and protein molecules evolve in a cell and interact 
in a continuous way, based on the stored information in the DNA and on 
external factors, and affect the functioning of a cell (neuron). 

e At the level of a single neuron, the internal information processes and the 
external stimuli change the synapses and cause the neuron to produce a signal 
to be transferred to other neurons. 

e At the level of neuronal ensembles, all neurons operate together as a function 
of the ensemble through continuous learning. 

e At the level of the whole brain, cognitive processes take place in a life-long 
incremental multiple task/multiple modalities learning mode, such as lan- 
guage and reasoning, and global information processes are manifested, such 
as consciousness. 

e At the level of a population of individuals, species evolve through evolution 
via changing the genetic DNA code. 


Building computational models that integrate principles from different informa- 
tion levels may be efficient for solving complex problems. These models are called 
integrative connectionist learning systems [19]. Information processes at different 
levels in the information hierarchy interact and influence each other. 


Neurons 


Among the 10!! neurons in the human brain, about 10!° are in the cortex. 
The cortex is the outer mantle of cells surrounding the central structures, e.g., 
brainstem and thalamus. Cortical thickness varies mostly between 2-3 mm in 
the human, and is folded with an average surface area is about 2200 cm? [42]. 
The neuron, or nerve cell, is the fundamental anatomical and functional unit 
of the nervous system including the brain. A neuron is an extension of the simple 
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Figure 1.1 Schematic drawing of a prototypical neuron. 


cell with two types of appendages: multiple dendrites and an axon. A neuron pos- 
sesses all the internal features of a regular cell. A neuron has four components: 
the dendrites, the soma (cell body), the axon and the synapse. A soma contains 
a cell nucleus. Dendrites branch into a bushy network around the cell to receive 
input from other neurons, whereas the axon stretches out for a long distance, 
typically a centimeter and as far as a meter in extreme cases. The axon is an 
output channel to other neurons; it branches into strands and substrands to con- 
nect to the dendrites and cell bodies of other neurons. The connecting junction 
is called a synapse. Each cortical neuron receives 104-105 synaptic connections, 
with most inputs coming from distant neurons. Thus connections in the cortex 
are said to exhibit long-range excitation and short-range inhibition. 

A neuron receives signals from other neurons through its soma and dendrites, 
integrates them, and sends output signals to other neurons through its axon. The 
dendrites receive signals from several neighborhood neurons and pass these onto 
the cell body, and are processed therein and the resulting signal is transferred 
through an axon. A schematic diagram shown in Fig. 1.1. 

Like any other cell, neurons have a membrane potential, that is, an electric 
potential difference between the intracellular and extracellular compartments, 
caused by the different densities of sodium (Na) and potassium (K). Neuronal 
membrane is endowed with relatively selective ionic channels that allow some 
specific ions to cross the membrane. The cell membrane has an electrical resting 
potential of —70 mV, which is maintained by pumping positive ions (Na+) out 
of the cell. Unlike an ordinary cell, the neuron is excitable. Because of inputs 
from the dendrites, the cell may not be able to maintain the —70 mV resting 
potential, resulting in an action potential that is a pulse transmitted down the 
axon. Signals are propagated from neuron to neuron by a complicated electro- 
chemical reaction. Chemical transmitter substances pass the synapses and enter 
the dendrite, changing the electrical potential of the cell body. When the poten- 
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Figure 1.2 The mathematical model of McCulloch-Pitts neuron. 


1.2.1 


tial is above a threshold, an electrical pulse or action potential is sent along the 
axon. After releasing the pulse, the neuron returns to its resting potential. The 
action potential causes a release of certain biochemical agents for transmitting 
messages are to the dendrites of nearby neurons. These biochemical transmit- 
ters may have either an excitatory or inhibitory effect on neighboring neurons. 
A synapse that increases the potential is excitatory, whereas a synapse that 
decreases it is inhibitory. 

Synaptic connections exhibit plasticity—long-term changes in the strength of 
connections in response to the pattern of stimulation. Neurons also form new 
connections with other neurons, and sometimes entire collections of neurons can 
migrate from one place to another. These mechanisms are thought to form the 
basis for learning in the brain. Synaptic plasticity is a basic biological mechanism 
underlying learning and memory. Inspired by this, a large number of learning 
rules, specifying how activity and training experience change synaptic efficacies 
[14], have been advanced. 


The McCulloch-Pitts neuron model 


A neuron is a basic processing unit in a neural network. It is a node that processes 
all fan-in from other nodes and generates an output according to a transfer func- 
tion called the activation function. The activation function represents a linear 
or nonlinear mapping from the input to the output and is denoted by ¢(-). The 
variable synapses is modelled by weights. The McCulloch-Pitts neuron model 
[27], which employs the sigmoidal activation function, was inspired biologically. 

Figure 1.2 illustrates the simple McCulloch-Pitts neuron model. The output 
of the neuron is given by 


J 
net = X_ witi —0 = w" a -— 9, (1.1) 
i=1 
y = (net), (1.2) 
where x; is the ith input, w; is the link weight from the ith input, w = 
(w1,...,wy,), @ = (a1,...,27,)", 0 is a threshold or bias, and J; is the number 
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Figure 1.3 VLSI model of a neuron. 


1.2.2 


of inputs. The activation function ¢(-) is usually some continuous or discontin- 
uous function, mapping the real numbers into the interval (—1,1) or (0,1). 

Neural networks are suitable for VLSI circuit implementations. The analog 
approach is extremely attractive in terms of size, power, and speed. A neuron 
can be realized with a simple amplifier and the synapse is realized with a resistor. 
Memristor is a two-terminal passive circuit element that acts as a variable resistor 
whose value can be varied by varying the current passing through it [2]. The 
circuit of a neuron is given in Fig. 1.3. Since weights from the circuits can only 
be positive, an inverter can be applied to the input voltage so as to realize a 
negative synaptic weight. 

By Kirchhoff’s current law, the output voltage of the neuron is derived as 


Ji 
;—1 Witi 
s-o (Zase o) as 


where x; is the ith input voltage, w; is the conductance of the ith resisitor, 0 the 
bias voltage, and ¢(-) is the transfer function of the amplifier. The bias voltage 
of a neuron in a VLSI circuit is caused by device mismatches, and is difficult to 
control. 

The McCulloch-Pitts neuron model is known as the classical perceptron model, 
and it is used in most neural network models, including the MLP and the Hopfield 
network. Many other neural networks are also based on the McCulloch-Pitts 
neuron model, but use other activation functions. For example, the adaline [39] 
and the SOM [23] use linear activation functions, and the RBF network adopts 
a radial basis function (RBF). 


Spiking neuron models 


Many of the intrinsic properties seen within the brain were not included in the 
classical perceptron, limiting their functionality and use to linear discrimination 
tasks. A single classical perceptron is not capable of solving nonlinear problems, 
such as the XOR problem. Spiking neuron and spiking neural network models 
mimic the spiking activity of neurons in the brain when processing information. 
Spiking neurons tend to gather in functional groups firing together during strict 
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time intervals, forming coalitions or assemblies [18], also referred to as events. 
Spiking neural networks represent information as trains of spikes. This results in 
a much higher number of patterns stored in a model and more flexible processing. 

During training of a spiking neural network, the weight of the synapse is 
modified according to the timing difference between the pre-synaptic spike and 
the post-synaptic spike. This synaptic plasticity is called spike-timing-dependent 
plasticity [26]. In a biological system, a neuron integrates the excitatory post- 
synaptic current, which is produced by presynaptic stimulus, to change the volt- 
age of its soma. If the soma voltage is larger than a defined threshold, an action 
potential (spike) is produced. 

The integrate-and-fire neuron [36], FitzHugh-Nagumo neuron [8], and 
Hodgkin-Huxley neuron model all incorporate more of the dynamics of actual 
biological neurons. Whereas the Hodgkin-Huxley model describes the biophys- 
ical mechanics of neurons, both the integrate-and-fire and FitzHugh Nagumo 
neurons model key features of biological neurons such as the membrane poten- 
tial, excitatory postsynaptic potential, and inhibitory postsynaptic potential. A 
single neuron incorporating these key features has a higher dimension to the 
information it processes in terms of its membrane threshold, firing rate and 
postsynaptic potential, than a classical perceptron. The integrate-and-fire neu- 
ron model, whose output is binary on a short time scale, either fires an action 
potential or does not. A spike train s € S(T) is a sequence of ordered spike times 
s = {tm ET : m= 1,..., N} corresponding to the time instants in the interval 
T = |0, T] at which a neuron fires. The FitzHugh-Nagumo model is a simplified 
version of the Hodgkin-Huxley model which models in a detailed manner the 
activation and deactivation dynamics of a spiking neuron. 

The Hodgkin-Huxley model [15, 21] incorporates the principal neurobiological 
properties of a neuron in order to understand phenomena such as the action 
potential. It was obtained from empirical investigation of the physiological prop- 
erties of the squid axon into a dynamical system framework. The model is a 
set of conductance-based coupled ordinary differential equations, incorporating 
sodium (Na), potassium (K) and chloride (Cl) ion flows through their respective 
channels. These equations are based upon the physical properties of cell mem- 
branes and the ion currents passing through transmembrane proteins. Chloride 
channel conductances are static (not voltage dependent) and hence leaky. 

According to the Hodgkin-Huxley model, the dynamics of the membrane 
potential V(t) of the neuron can be described by 


dV 
dt 
where the first three terms on the right-hand side correspond to the potassium, 
sodium, and leakage currents, respectively, and gya = 120 mS/cm?, gg = 36 


= —gyaM?h(V — Vya) — gnf (V — Ve) —91(V — VL) + I(t), (1.4) 


mS/cm? and gr = 0.3 mS/cm? are the maximal conductances of sodium, potas- 
sium and leakage, respectively. The membrane capacitance C = 1 mF/cm?; 
Vna = 50 mV, Vg = —77 mV, and Vz; = —54.4 mV are the reversal potentials 
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Figure 1.4 Parameters of the Hodgkin-Huxley model for a neuron. 


1.3 


of sodium, potassium, and leakage currents, respectively. I(t) is the injected cur- 
rent. The stochastic gating variables n, m and h represent the activation term 
of the potassium channel, the activation term, and the inactivation term of the 
sodium channel, respectively. The factors n4 and mh are the mean portions of 
the open potassium and sodium ion channels within the membrane patch. To 
take into account the channel noise, m, h and n obey the Langevin equations. 
When the stimuli $1 and S2 occur at 15 ms and 40 ms of 80 ms, the simulated 
results for V, m, h and n are plotted in Fig. 1.4; this figure was generated by a 
Java applet (http: //thevirtualheart.org/HHindex.htm1). 


Neural networks 


A neural network is characterized by the network architecture, node character- 
istics, and learning rules. 


Architecture 
The network architecture is represented by the connection weight matrix W = 
[wij], where wij denotes the connection weight from node i to node j. When 
wij = 0, there is no connection from node 7 to node j. By setting some w,;’s to 
zero, different network topologies can be realized. Neural networks can be grossly 
classified into feedforward neural networks, recurrent neural networks, and their 
hybrids. 

Popular network topologies are fully connected layered feedforward networks, 
recurrent networks, lattice networks, layered feedforward networks with lateral 
connections, and cellular networks, as shown in Fig. 1.5. 


e In a feedforward network, the connections between neurons are in one direc- 
tion. A feedforward network is usually arranged in the form of layers. In such 
a layered feedforward network, there is no connection between the neurons in 
the same layer, and there is no feedback between layers. In a fully connected 
layered feedforward network, every node in any layer is connected to every 
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Figure 1.5 Architecture of neural networks. (a) Layered feedforward network. (b) Recurrent network. 
(c) Two-dimensional lattice network. (d) Layered feedforward network with lateral connections. (e) 
Cellular network. The big numbered circles stand for neurons and the small ones for input nodes. 


node in its adjacent forward layer. The MLP and the RBF network are fully 
connected layered feedforward networks. 

e In a recurrent network, there exists at least one feedback connection. The 
Hopfield model and the Boltzmann machine are two examples of recurrent 
networks. 

e A lattice network consists of one-, two- or higher-dimensional array of neurons. 
Each array has a corresponding set of input nodes. The Kohonen network [23] 
uses a one- or two-dimensional lattice architecture. 

e A layered feedforward network with lateral connections has lateral connec- 
tions between the units at the same layer of its layered feedforward network 
architecture. A competitive learning network has a two-layered network of 
such an architecture. The feedforward connections are excitatory, while the 
lateral connections in the same layer are inhibitive. Some PCA networks using 
the Hebbian/anti-Hebbian learning rules [33] also employ this kind of network 
topology. 

e A cellular network consists of regularly spaced neurons, called cells, which 
communicate only with the neurons in its immediate neighborhood. Adjacent 
cells are connected by mutual interconnections. Each cell is excited by its own 
signals and by signals flowing from its adjacent cells [5]. 


In this book, we use the notation J1-J2-. ..-Jm to represent a neural network 
with a layered architecture of M layers, where J; is the number of nodes in the 
ith layer. Notice that the input layer is counted as layer 1 and nodes at this layer 
are not neurons. Layer M is the output layer. 
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Operation 

The operation of neural networks is divided into two stages: learning (train- 
ing) and generalization (recalling). Network training is typically accomplished 
by using examples, and network parameters are adapted using a learning algo- 
rithm, in an online or offline manner. Once the network is trained to accomplish 
the desired performance, the learning process is terminated and it can then be 
used directly to replace the complex system dynamics. The trained network can 
be used to operate in a static manner: to emulate an unknown dynamics or 
nonlinear relationship. 

For real-time applications, a neural network is required to have a constant 
processing delay regardless of the number of input nodes and to have a minimum 
number of layers. As the number of input nodes increases, the size of the network 
layers should grow at the same rate without additional layers. 

Adaptive neural networks are a class of neural networks that do not need to 
be trained by providing a training pattern set. They can learn when they are 
performing. For adaptive neural networks, unsupervised learning methods are 
usually used. For example, the Hopfield model uses a generalized Hebbian learn- 
ing rule for implementation as associative memory. Any time a pattern is pre- 
sented to it, the Hopfield network always updates the connection weights. After 
the network is trained with standard patterns and is prepared for generalization, 
the learning capability should be disabled; otherwise, when an incomplete or 
noisy pattern is presented to the network, it will search the closest matching, 
meanwhile the memorized pattern is replaced by this new pattern. 

Reinforcement learning is also naturally adaptive, where the environment is 
treated as a teacher. Supervised learning is not adaptive in nature. 


Properties 

Neural networks are biologically motivated. Each neuron is a computational 
node, which represents a nonlinear function. Neural networks possess the fol- 
lowing advantages [7]: 


e Adaptive learning: They can adapt themselves by changing the network 
parameters in a surrounding environment. 

e Generalization: A trained neural network has superior generalization capa- 
bility. 

e General-purpose nonlinear nature: They perform like a black box. 

e Self-organizing: Some neural networks such as the SOM [23] and competitive 
learning based neural networks have a self-organization property. 

e Massive parallelism and simple VLSI implementations: Each basic 
processing unit usually has a uniform property. This parallel structure allows 
for highly parallel software and hardware implementations. 

e Robustness and fault tolerance: A neural network can easily handle 
imprecise, fuzzy, noisy, and probabilistic information. It is a distributed infor- 
mation system, where information is stored in the whole network in a dis- 
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tributed manner by the network structure such as W. Thus, the overall per- 
formance does not degrade significantly when the information at some nodes 
is lost or some connections in the network are damaged. The network repairs 
itself, and thus possesses a fault-tolerant capability. 


Applications 
Neural networks can be treated as a general stitistical tool for almost all dis- 
ciplines of science and engineering. The applications can be in modeling and 


system identification, classification, pattern recognition, optimization, control, 


industrial application, communications, signal processing, image analysis, bioin- 


formatics, and data mining. Pattern recognition is central to biological and arti- 
ficial intelligence; it is a complete process that gathers the observations, extracts 
features from the observations, and classifies or describes the observations. Pat- 
tern recognition is one of the most fundmental applications of neural networks. 


More specific, some neural network models have the following functions. 


Function approximation: This capability is generally used for modeling 
and system identification, regression and prediction, control, signal process- 
ing, pattern recognition and classification, and associative memory. Image 
restoration is also a function approximation problem. The MLP and RBF 
networks are universal approximators for nonlinear functions. Some recurrent 
networks are universal approximators of dynamical systems. Prediction is an 
open-loop problem while control is a closed-loop problem. 

Classification: Classification is the most fundamental application of neural 
networks. Classification can be based on the function approximation capability 
of neural networks. 

Clustering and vector quantization: Clustering groups together similar 
objects, based on some distance measure. Unlike in classification problems, 
the classmembership of a pattern is not known a priori. Vector quantization 
is similar to clustering. 

Associative memory: An association is an input-output pair. Associative 
memory, also known as content-addressable memory, is a memory organization 
that accesses memory by its content instead of its address. It picks up a 
desirable match from all stored prototypes, when an incomplete or corrupted 
sample is presented. Associative memories are useful for pattern recognition, 
pattern association, or pattern completion. 

Optimization: Some neural network models, such as the Hopfield model 
and the Boltzmann machine, can be used to solve combinatorial optimization 
problems (COPs). 

Feature extraction and information compression: Coding and informa- 
tion compression is an essential task in the transmission and storage of speech, 
audio, image, video, and other information. PCA, ICA, vector quantization 
can achieve the objective of feature extraction and information compression. 
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Chapter 1. Introduction 


Scope of the book 


This book contains twenty-six chapters and two appendices: 


Chapter 2 describes some fundamental topics on neural networks and machine 
learning. 

Chapter 3 is dedicated to the perceptron. 

The MLP is the topic of Chapters 4 and 5. The MLP with BP learning 
is introduced in Chapter 4, and structural optimization of the MLP is also 
described in this chapter. 

The MLP with second-order learning is introduced in Chapter 5. 

Chapter 6 treats the Hopfield model, its application for solving COPs, simu- 
lation annealing, chaotic neural networks and cellular networks. 

Chapter 7 describes associative memory models and algorithms. 

Chapters 8 and 9 are dedicated to clustering. Chapter 8 introduces Kohonen 
networks, ART networks, C-means, subtractive, and fuzzy clustering. 
Chapter 9 introduces many advanced topics in clustering. 

In Chapter 10, we elaborate on the RBF network model. 

Chapter 11 introduces the learning of general recurrent networks. 

Chapter 12 deals with PCA networks and algorithms. The minor compo- 
nent analysis (MCA), crosscorrelation PCA networks, generalzied eigenvalue 
decompostion (EVD) and CCA are also introduced in this chapter. 
Nonnegative matrix factorization (NMF) is introduced in Chapter 13. 

ICA and BSS are introduced in Chapter 14. 

Discriminant analysis is described in Chapter 15. 

Probilistic and Bayesian networks are introduced in Chapter 19. Many topics 
such as the EM algorithms, the HMM, sampling (Monte Carlo) methods, and 
the Boltzmann machine are treated in this framework. 

SVMs are introduced in Chapter 16. 

Kernel methods other than SVMs are introduced in Chapter 17. 
Reinforcement learning is introduced in Chapter 18. 

Ensemble learning is introduced in Chapter 20. 

Fuzzy sets and logic are introduced in Chapter 21. 

Neurofuzzy models are described in Chapter 22. Transformations between 
fuzzy logic and neural networks are also discussed. 

Implementation of neural networks in hardware is treated in Chapter 23. 

In Chapter 24, we give an introduction to neural network applications to 
biometrics and bioinformatics. 

Data mining as well as the application of neural networks to the field is intro- 
duced in Chapter 25. 

Mathematical preliminaries are included in Appendix A. 

Some benchmarks and resources are included in Appendix B. 


Examples and exercises are included in most of the chapters. 
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1.1 List the major differences between the neural-network approach and clas- 
sical information-processing approaches. 


1.2 Formulate a McCulloch-Pitts neuron for four variables: white blood count, 
systolic blood pressure, diastolic blood pressure, and pH of the blood. 


1.3 Derive Equation (1.3) from Fig. 1.3. 
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Fundamentals of Machine Learning 


Learning methods 


Learning is a fundamental capability of neural networks. Learning rules are algo- 
rithms for finding suitable weights W and/or other network parameters. Learn- 
ing of a neural network can be viewed as a nonlinear optimization problem for 
finding a set of network parameters that minimize the cost function for given 
examples. This kind of parameter estimation is also called a learning or training 
algorithm. 

Neural networks are usually trained by epoch. An epoch is a complete run when 
all the training examples are presented to the network and are processed using 
the learning algorithm only once. After learning, a neural network represents a 
complex relationship, and possesses the ability for generalization. To control a 
learning process, a criterion is defined to decide the time for terminating the 
process. The complexity of an algorithm is usually denoted as O(m), indicating 
that the order of number of floating-point operations is m. 

Learning methods are conventionally divided into supervised, unsupervised, 
and reinforcement learning; these schemes are illustrated in Fig. 2.1. £p and y, 
are the input and output of the pth pattern in the training set, 9, is the neural 
network output for the pth input, and F is an error function. From a statistical 
viewpoint, unsupervised learning learns the pdf of the training set, p(a), while 
supervised learning learns about the pdf of p(y|a). Supervised learning is widely 
used in classification, approximation, control, modeling and identification, signal 
processing, and optimization. Unsupervised learning schemes are mainly used for 
clustering, vector quantization, feature extraction, signal coding, and data anal- 
ysis. Reinforcement learning is usually used in control and artificial intelligence. 


In logic and statistical inference, transduction is reasoning from observed, spe- 
cific (training) cases to specific (test) cases. In contrast, induction is reasoning 
from observed training cases to general rules, which are then applied to the 
test cases. Machine learning falls into two broad classes: inductive learning or 
transductive learning. Inductive learning pursues the standard goal in machine 
learning, which is to accurately classify the entire input space. In contrast, trans- 
ductive learning focuses on a predefined target set of unlabeled data, the goal 
being to label the specific target set. 
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(c) 
Figure 2.1 Learning methods. (a) Supervised learning. e, = ĝ, — Yp. (b) Unsupervised learning. (c) 
Reinforcement learning. 


Multitask learning improves the generalization performance of learners by 
leveraging the domain-specific information contained in the related tasks [30]. 
Multiple related tasks are learned simultaneously using a shared representation. 
In fact, the training signals for extra tasks serve as an inductive bias [30]. 

In order to learn accurate models for rare cases, it is desirable to use data 
and knowledge from similar cases; this is known as transfer learning. Transfer 
learning is a general method for speeding up learning. It exploits the insight 
that generalization may occur not only within tasks, but also across tasks. The 
core idea of transfer is that experience gained in learning to perform one source 
task can help improve learning performance in a related, but different, target 
task [155]. Transfer learning is related in spirit to case-based and analogical 
learning. A theoretical analysis based on an empirical Bayes perspective exhibits 
that the number of labeled examples required for learning with transfer is often 
significantly smaller than that required for learning each target independently 
[155]. 


Supervised learning 

Supervised learning adjusts network parameters by a direct comparison between 
the actual network output and the desired output. Supervised learning is a closed- 
loop feedback system, where the error is the feedback signal. The error measure, 
which shows the difference between the network output and the output from 
the training samples, is used to guide the learning process. The error measure is 
usually defined by the mean squared error (MSE) 


A (2.1) 





LŽ 
E= g2 lv- 


where N is the number of pattern pairs in the sample set, y, is the output part of 
the pth pattern pair, and y, is the network output corresponding to the pattern 
pair p. The error EF is calculated anew after each epoch. The learning process is 
terminated when F is sufficiently small or a failure criterion is met. 

To decrease E toward zero, a gradient-descent procedure is usually applied. 
The gradient-descent method always converges to a local minimum in a neighbor- 
hood of the initial solution of network parameters. The LMS and BP algorithms 
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are two most popular gradient-descent based algorithms. Second-order methods 
are based on the computation of the Hessian matrix. 

Multiple-instance learning [46] is a variation of supervised learning. In 
multiple-instance learning, the examples are bags of instances, and the bag label 
is a function of the labels of its instances. Typically, this function is the Boolean 
OR. A unified theoretical analysis for multiple-instance learning and a PAC- 
learning algorithm are introduced in [122]. 

Deductive reasoning starts from a cause to deduce the consequence or effects. 
Inductive reasoning allows us to deduce possible causes from the consequence. 
The inductive learning is a special class of the supervised learning techniques, 
where given a set of {x;, f(a;)} pairs, we determine a hypothesis h(a;) such 
that h(a;) ~ f(x;), Vi. In inductive learning, given many positive and negative 
instances of a problem the learner has to form a concept that supports most 
of the positive but no negative instances. This requires a number of training 
instances to form a concept in inductive learning. Unlike this, analogical learn- 
ing can be accomplished from a single example; for instance, given a training 
instance of plural of fungus as fungi, one can determine the plural of bacilus: 
bacillus -> bacilli. 


Unsupervised learning 

Unsupervised learning involves no target values. It tries to autoassociate infor- 
mation from the inputs with an intrinsic reduction of data dimensionality or 
total amount of input data. Unsupervised learning is solely based on the cor- 
relations among the input data, and is used to find the significant patterns or 
features in the input data without the help of a teacher. Unsupervised learning 
is particularly suitable for biological learning in that it does not rely on a teacher 
and it uses intuitive primitives like neural competition and cooperation. 

A criterion is needed to terminate the learning process. Without a stopping 
criterion, a learning process continues even when a pattern, which does not 
belong to the training patterns set, is presented to the network. The network 
is adapted according to a constantly changing environment. Hebbian learning, 
competitive learning, and the SOM are the three well-known unsupervised learn- 
ing approaches. Generally speaking, unsupervised learning is slow to settle into 
stable conditions. 

In Hebbian learning, learning is a purely local phenomenon, involving only 
two neurons and a synapse. The synaptic weight change is proportional to the 
correlation between the pre- and post-synaptic signals. Many neural networks 
for PCA and associative memory are based on Hebbian learning. In competi- 
tive learning, the output neurons of a neural network compete for the right to 
respond. The SOM is also based on competitive learning. Competitive learning is 
directly related to clustering. The Boltzmann machine uses a stochastic training 
technique known as simulated annealing, which can been treated as a special type 
of unsupervised learning based on the inherent property of a physical system. 
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Reinforcement learning 

Reinforcement learning is a class of computational algorithms that specifies how 
an artificial agent (e.g., a real or simulated robot) can learn to select actions 
in order to maximize the total expected reward [11]. This computed difference, 
termed reward-prediction error, has been shown to correlate very well with the 
phasic activity of dopamine-releasing neurons projecting from the substantia 
nigra in non-human primates [125]. 

Reinforcement learning is a special case of supervised learning, where the exact 
desired output is unknown. The teacher supplies only feedback about success or 
failure of an answer. This is cognitively more plausible than supervised learn- 
ing since a fully specified correct answer might not always be available to the 
learner or even the teacher. It is based only on the information as to whether 
or not the actual output is close to the estimate. Reinforcement learning is a 
learning procedure that rewards the neural network for its good output result 
and punishes it for the bad output result. Explicit computation of derivatives 
is not required. This, however, presents a slower learning process. For a control 
system, if the controller still works properly after an input, the output is judged 
as good; otherwise, it is considered as bad. The evaluation of the binary output, 
called external reinforcement, is used as the error signal. 


Semi-supervised learning and active learning 

In many machine learning applications, such as bioinformatics, web and text 
mining, text categorization, database marketing, spam detection, face recogni- 
tion, and video-indexing, abundant amounts of unlabeled data can be cheaply 
and automatically collected. However, manual labeling is often slow, expensive, 
and error-prone. When only a small number of labeled samples are available, 
unlabeled samples could be used to prevent the performance degradation due to 
overfitting. 

The goal of semi-supervised learning is to employ a large collection of unlabeled 
data jointly with a few labeled examples for improving generalization perfor- 
mance. Some semi-supervised learning methods are based on some assumptions 
that relate the probability P(x) to the conditional distribution P(Y = 1|X = zx). 
Semi-supervised learning is related to the problem of transductive learning. Two 
typical semi-supervised learning approaches are learning with the cluster assump- 
tion [148] and learning with the manifold assumption [18]. The cluster assump- 
tion requires that data within the same cluster are more likely to have the same 
label. The most prominent example is the transductive SVM [148]. 

Universum data are given a set of unlabeled examples and do not belong to 
either class of the classification problem of interest. Contradiction happens when 
two functions in the same equivalence class have different signed outputs on a 
sample from the Universum. Universum learning is conceptually different from 
semi-supervised learning or transduction [148], because the Universum data is 
not from the same distribution as the labeled training data. Universum learning 
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implements a trade-off between explaining training samples (using large margin 
hyperplanes) and maximizing the number of contradictions (on the Universum). 

In active learning, or so-called pool-based active learning, the labels of data 
points are initially hidden, and the learner must pay for each label he wishes to 
be revealed. The goal of active learning is to actively select the most informative 
examples for manual labeling in these learning tasks, that is, designing input 
signals for optimal generalization [55]. Based on conditional expectation of the 
generalization error, a pool-based active learning method effectively copes with 
model misspecification by weighting training samples according to their impor- 
tance [136]. Reinforcement learning can be regarded as a form of active learning. 
At this point a query mechanism pro-actively asks for the labels of some of the 
unlabeled data. 

Examples of situations in which active learning can be employed are web 
searching, email filtering, and relevance feedback for a database or website. The 
first two examples involve induction. The goal is to create a classifier that works 
well on unseen future instances. The third situation is an example of transduction 
[148]. The learner’s performance is assessed on the remaining instances in the 
database rather than a totally independent test set. 

The query-by-committee algorithm [56] is an active learning algorithm for 
classification which uses a prior distribution over hypotheses. In this algorithm, 
the learner observes a stream of unlabeled data and makes spot decisions about 
whether or not to ask for each point’s label. If the data is drawn uniformly 
from the surface of the unit sphere in R, and the hidden labels correspond 
perfectly to a homogeneous (i.e., through the origin) linear separator from this 
same distribution, then it is possible to achieve generalization error € after seeing 
O((d/e) log(1/e)) points and requesting just O(dlog(1/e)) labels: an exponential 
improvement over the usual O(d/e) sample complexity of learning linear separa- 
tors in a supervised setting. The query-by-committee algorithm involves random 
sampling from intermediate version spaces; the complexity of the update step 
scales polynomially with the number of updates performed. 

An information-based approach for active data selection is presented in [90]. 
In [135], a two-stage sampling scheme for reducing both the bias and variance is 
given, and based on it, two active learning methods are given. 

In a framework for batch mode active learning [75], a number of informative 
examples are selected for manual labeling in each iteration. The key feature is 
to reduce the redundancy among the selected examples such that each example 
provides unique information for model updating. The set of unlabeled examples 
that can efficiently reduce the Fisher information of the classification model is 
chosen [75]. 
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Learning and generalization 


From an approximation viewpoint, learning is a hypersurface reconstruction 
based on existing examples, while generalization means estimating the value on 
the hypersurface where there is no example. Mathematically, the learning pro- 
cess is a nonlinear curve-fitting process, while generalization is the interpolation 
and extrapolation of the input data. 

The goal of training neural networks is not to learn an exact representation 
of the training data itself, but rather to build a statistical model of the process 
which generates the data. The problem of reconstructing the mapping is said to 
be well-posed if an input always generates a unique output, and the mapping 
is continuous. Learning is an ill-posed inverse problem. Given examples of an 
input-output mapping, an approximate solution is required to be found for the 
mapping. The input data may be noisy or imprecise, and also may be insufficient 
to uniquely construct the mapping. The regularization technique can transform 
an ill-posed problem into a well-posed one so as to stabilize the solution by 
adding some auxiliary non-negative functional for constraints [140, 106]. 

When a network is overtrained with too many examples, parameters or epochs, 
it may produce good results for the training data, but has a poor generalization 
capability. This is the overfitting phenomenon, and is illustrated in Fig. 2.2. In 
statistics, overfitting applies to the situation wherein a model possesses too many 
parameters, and fits the noise in the data rather than the underlying function. A 
simple network with smooth input-output mapping usually has a better general- 
ization capability. Generally, the generalization capability of a network is jointly 
determined by the size of the training pattern set, the complexity of the problem, 
and the architecture of the network. 


Example 2.1: To approximate a noisy cosine function, with 20 random samples, 
we employ a 1-30-1 feedforward network. The result is plotted in Fig. 2.2. The 
noisy samples is represented by the “o” symbols, and the true network response 
is given by the solid line. Clearly, the learned network is overfitted, and it does 
not generalize well. Notice that if the number of parameters in the network is 
much smaller than the total number of points in the training set, then there is 


little or no worry of overfitting. 


For a given network topology, we can estimate the minimal size of the training 
set for successfully training the network. For conventional curve-fitting tech- 
niques, the required number of examples usually grows with the dimensionality 
of the input space, namely, the curse of dimensionality. Feature extraction can 
reduce input dimensionality and thus improve the generalization capability of 
the network. 
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Figure 2.2 Proper fitting and overfitting. Dashed line corresponds to proper fitting, and solid line 
corresponds to overfitting. 


2.2.1 


The training set should be sufficiently large and diverse so that it could rep- 
resent the problem well. For good generalization, the size of the training set, N, 
should be at least several times larger than the network’s capacity, i.e. N > a 
where N,, the total number of weights or free parameters, and N, the number 
of output components [150]. 


Generalization error 


The generalization error of a trained network can be decomposed into two parts, 
namely, an approximation error that is due to a finite number of parameters of 
the approximation scheme used and an unknown level of noise in the training 
data, and an estimation error that is due to a finite number of data available 
[98]. For a feedforward network with Jı input nodes and a single output node, 
a bound on the generalization error is associated with the order of hypothesis 
parameters Np and the number of examples N [98] 


z 1/2 
O (5) +0 ——— ) , with probability p > 1 — ô, 
P 


(2.2) 
where ô € (0,1) is the confidence parameter, and Np is proportional to the num- 
ber of parameters, such as Np centers in an RBF network, or Np sigmoidal 
hidden units in an MLP. The first term corresponds to the bound on the approx- 
imation error, and the second to that on the estimation error. 

As Np increases, the approximation error decreases since a larger model is 
used; however, the estimation error increases due to overfitting (or alternatively, 
more data). Thus, one cannot reduce the upper bounds on both the error com- 
ponents simultaneously. Given the amount of data available, the optimal size of 
the model for the tradeoff between the approximation and estimation errors is 
selected as Np œ N3 [98]. After suitably selecting Np and N, the generalization 
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error for feedforward networks should be O (x). This result is similar to that 
for an MLP with sigmoidal functions [10]. 


Generalization by stopping criterion 


Generalization can be controlled during training. Overtraining can be avoided by 
stopping the training before the absolute minimum is reached. Neural networks 
trained with iterative gradient-based methods tend to learn a mapping in the 
hierarchical order of its increasing components of frequency. When training is 
stopped at an appropriate point, the network will not learn the high-frequency 
noise. While the training error will always decrease, the generalization error will 
decrease to a minimum and then begins to rise again as the network is being 
overtrained. Training should stop at the optimum stopping point. The general- 
ization error is defined in the same form as the learning error, but on a separate 
validation set of data. Early stopping is the default method for improving gen- 
eralization. 


Example 2.2: In order to use early stopping technique, the available data is 
divided into three subsets. The first subset is the training set. The second subset 
is the validation set. The error on the validation set is monitored during the 
training process. The validation error normally decreases during the initial phase 
of training, as does the error on the training set. However, when the network 
begins to overfit the data, the error on the validation set typically begins to rise. 
When the validation error increases for a specified number of iterations, training 
is stopped, and the weights and biases at the minimum of the validation error 
are returned. The error on the test set is not used during training, but it is used 
to compare different models. It is also useful to plot the error on the test set 
during the training process. If the error on the test set reaches a minimum at a 
significantly different iteration number than the error on the validation set, this 
might indicate a poor division of the data set. From Example 2.1, the 40 data 
samples are divided by 60%, 20% and 20% of samples as the training, validation, 
and test sets. The relation is illustrated in Fig. 2.3 


Early stopping is implemented with crossvalidation to decide when to stop. 
Three early-stopping criteria are defined and empirically compared in [107]. 
Slower stopping criteria, which stop later than others, on average lead to small 
improvements in generalization, but result in a much longer training time [107]. 

Statistical analysis for the three-layer MLP has been performed in [4]. As far as 
the generalization performance is concerned, exhaustive learning is satisfactory 
when N > 30N,,. When N < Nw, early stopping can really prevent overtraining. 
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Best Validation Performance is 0.3821 at epoch 2 
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Figure 2.3 Learning and generalization. When the network is overtrained, its generalization 
performance degrades. 


2.2.3 


When N < 30N,,, overtraining may also occur. In the latter two cases, crossval- 
idation can be used to stop training. 

An optimized approximation algorithm [88] avoids overfitting in function 
approximation applications. The algorithm utilizes a quantitative stopping cri- 
terion based on the estimation of the signal-to-noise-ratio figure (SNRF). Using 
SNRF, overfitting can be automatically detected from the training error only. 


Generalization by regularization 


Regularization is a reliable method for improving generalization. The target func- 
tion is assumed to be smooth, and small changes in the input do not cause large 
changes in the output. A constraint term Ee, which penalizes poor generalization, 
is added to the standard training cost function Æ 


Er = E + Eo, (2.3) 


where A, is a positive value that balances the tradeoff between error minimization 
and smoothing. In contrast to the early-stopping criterion method, the regular- 
ization method is applicable to both the iterative gradient-based techniques and 
the one-step linear optimization such as the singular value decomposition (SVD) 
technique. 

Network-pruning techniques such as the weight-decay technique also help to 
improve generalization [109, 23]. At the end of training, there are some weights 
significantly different from zero, while some other weights are close to zero. Those 
connections with small weights can be removed from the network. Biases should 
be excluded from the penalty term so that the network yields an unbiased esti- 
mate of the true target mean. 

Early stopping has a behavior similar to that of a simple weight-decay tech- 
nique in the case of the MSE function [22]. The quantity =a where 7 is the 
learning rate and t is the iteration index, plays the role of Ae. The effective num- 
ber of weights, that is, the number of weights whose values differ significantly 


from zero, grows as training proceeds. 
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Training with a small amount of jitter in the input while keeping the same 
output can improve generalization. With jitter, the learning problem is equivalent 
to a smoothing regularization with the noise variance playing the role of the 
regularization parameter [109, 22]. Training with jitter thus allows regularization 
within the conventional layered feedforward network architecture. Although large 
networks are generally trained rapidly, they tend to generalize poorly due to 
insufficient constraints. Training with jitter helps to prevent overfitting. 

In [76], noise is added to the available training set to generate an unlimited 
source of training samples. This is interpreted as a kernel estimate of the proba- 
bility density that describes the training vector distribution. It helps to enhance 
the generalization performance, speed up the BP, and reduce the possibility of 
local minima entrapment. 

In [73], each weight is encoded with a short bit-length to decrease the com- 
plexity of the model. The amount of information in a weight can be controlled 
by adding Gaussian noise and the noise level can be adapted during learning to 
optimize the tradeoff between the expected squared error of the network and the 
amount of information in the weights. 

Weight sharing is to control several weights by a single parameter [120]. This 
reduces the number of free parameters in a network, and thus improves gen- 
eralization. A soft weight-sharing method is implemented in [99] by adding a 
regularization term to the error function, where the learning algorithm decides 
which of the weights should be tied together. 

Regularization decreases the representation capability of the network, but 
increases the bias (bias-variance dilemma [59]). The principle of regularization 
is to choose a well-defined regularizer to decrease the variance by affecting the 
bias as little as possible [22]. 


Fault tolerance and generalization 


Fault tolerance is strongly associated with generalization. Input noise during 
training improves generalization ability [23], and synaptic noise during training 
improves fault tolerance [95]. When fault tolerance is improved, the general- 
ization ability is usually better [52], and vice versa [42]. The lower the weight 
magnitude, the higher the fault tolerance [42], [20] and the generalization ability 
[83], [20]. Based on the Vapnik-Chervonenkis (VC) dimension, it is qualitatively 
explained in [103] why adding redundancy can improve fault tolerance and gen- 
eralization. 

Fault tolerance is related to a uniform distribution of the learning among the 
different neurons, but the BP algorithm does not guarantee this good distribution 
[52]. Just as input noise is introduced to enhance the generalization ability, the 
perturbation of weights during training also increases the fault tolerance of MLPs 
[95], [52]. Saliency, used as a measurement of fault tolerance to weight deviations 
[95], is computed from the diagonal elements of the Hessian matrix of the error 
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with respect to the weight values. A low value of saliency implies a higher fault 
tolerance. 

An analysis of the influence of weight and input perturbations in an MLP is 
made in [20]. The measurements introduced are explicitly related to the MSE 
degradation in the presence of perturbations, thus constituting a selection crite- 
rion between different alternatives of weight configurations. Quantitative mea- 
surements of fault tolerance, noise immunity, and generalization ability are pro- 
vided, from which several previous conjectures are deduced. The methodology 
is also applicable to the study of the tolerance to perturbations of other layered 
networks such as the RBF networks. 

When training the MLP with on-line node fault injection, hidden nodes ran- 
domly output zeros during training. The trained MLP is able to tolerate random 
node fault. The convergence of the algorithm is proved in [137]. The correspond- 
ing objective functions consist of an MSE term, a regularizer term, and a weight 
decay term. 

Six common fault/noise-injection-based online learning algorithms, namely, 
injecting additive input noise, injecting additive/multiplicative weight noise, 
injecting multiplicative node noise, injecting multiweight fault (random discon- 
nection of weights), injecting multinode fault during training, and weight decay 
with injecting multinode fault, are investigated in [74] for RBF networks. The 
convergence of the six online algorithms is shown to be almost sure, and their 
true objective functions being minimized are derived. For injecting additive input 
noise during training, the objective function is identical to that of the Tikhonov 
regularizer approach. For injecting additive/multiplicative weight noise during 
training, the objective function is the simple mean square training error; thus, 
injecting additive/multiplicative weight noise during training cannot improve the 
fault tolerance of an RBF network. Similar to injective additive input noise, the 
objective functions of other fault /noise-injection-based online algorithms contain 
an MSE term and a specialized regularization term. 


Sparsity versus stability 


Stability establishes the generalization performance of an algorithm [26]. Sparsity 
and stability are two desired properties of learning algorithms. Both properties 
lead to good generalization ability. These two properties are fundamentally at 
odds with each other and this no-free-lunch theorem is proved in [153], [154]: A 
sparse algorithm cannot be stable and vice versa. A sparse algorithm can have 
nonunique optimal solutions and is therefore ill-posed. If an algorithm is sparse, 
then its uniform stability is lower bounded by a nonzero constant. This also 
shows that any algorithmically stable algorithm cannot be sparse. Thus, one has 
to trade off sparsity and stability in designing a learning algorithm. 

In [153], [154], Ly-regularized regression (LASSO) is shown to be not stable, 
while L2-regularized regression is known to have strong stability properties and 
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is therefore not sparse. Sparsity promoting algorithms include LASSO, Lı-norm 
SVM, deep belief network, and sparse PCA. 


Model selection 


Occam’s razor was formulated by William of Occam in the late Middle Ages. 
Occam’s razor principle states: “No more things should be presumed to exist 
than are absolutely necessary.” That is, if two models of different complexity fit 
the data approximately equally well, the simpler one usually is a better predictive 
model. From models approximating the noisy data, the ones that have minimal 
complexity should be chosen. 

The objective of model selection is to find a model that is as simple as possible 
that fits a given data set with sufficient accuracy, and has a good generalization 
capability to unseen data. The generalization performance of a network gives 
a measure of the quality of the chosen model. Model-selection approaches can 
be generally grouped into four categories: crossvalidation, complexity criteria, 
regularization, and network pruning/growing. 

The generalization error of a learning method can be estimated via either cross- 
validation or bootstrap. In crossvalidation methods, many networks of different 
complexity are trained and then tested on an independent validation set. The 
procedure is computationally demanding and/or requires additional data with- 
held from the total pattern set. In complexity criterion-based methods, training 
of many networks is required and hence, computationally demanding, though a 
validation set is not required. Regularization methods are more efficient than 
crossvalidation techniques, but the results may be suboptimal since the penalty 
terms damage the representation capability of the network. Pruning/growing 
methods can be under the framework of regularization, which often makes restric- 
tive assumptions, resulting in networks that are suboptimal. 


Crossvalidation 


Crossvalidation is a standard model-selection method in statistics [78]. The total 
pattern set is randomly partitioned into a training set and a validation (test) set. 
The major part of the total pattern set is included in the training set, which is 
used to train the network. The remaining, typically, 10 to 20 per cent, is included 
in the validation set and is used for validation. When only one sample is used 
for validation, the method is called leave-one-out crossvalidation. Methods on 
conducting crossvalidation are given in [107]. This kind of hold-out estimate of 
performance lacks computational efficiency due to the repeated training, but 
with lower variance of the estimate. 

Let D; and D;, i = 1,...,m, be the data subsets of the total pattern set arising 
from the ith partitioning, which are, respectively, used for training and testing. 
The crossvalidation process trains the algorithm m times, and is actually to find 
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a suitable model by minimizing the log-likelihood function 


Ups -42h (1 (®©) 





Di) ) (2.4) 


where W (D;) denotes the maximum-likelihood (ML) parameter estimates on 
Di, and L (W (Di) | Di) is the likelihood evaluated on the data set Dj. 

Validation uses data different from the training set, thus the validation set is 
independent from the estimated model. This helps to select the best one among 
the different model parameters. Since this data set is independent from the esti- 
mated model, the generalization error obtained is a fair estimate. Sometimes it 
is not optimal if we train the network to perfection on a given pattern set due 
to the ill-posedness of the finite training pattern set. Crossvalidation helps to 
generate good generalization of the network, when JN, the size of the training 
set, is too large. Crossvalidation is effective for finding a large network with a 
good generalization performance. 

The popular K-fold crossvalidation [134] employs a nonoverlapping test set 
selection scheme. The data universe D is divided into K nonoverlapping data 
subsets of the same size. Each data subset is then used as a test set, with the 
remaining K — 1 folds acting as a training set, and an error value is calculated 
by testing the classifier in the remaining fold. Finally, the K-fold crossvalidation 
estimation of the error is the average value of the errors committed in each fold. 
Thus, the K-fold crossvalidation error estimator depends on two factors: the 
training set and the partitioning into folds. Estimating the variance of K-fold 
crossvalidation can be done from independent realizations or from dependent 
realizations whose correlation is known. K-fold crossvalidation produces depen- 
dent test errors. Consequently, there is no universal unbiased estimator of the 
variance of K-fold crossvalidation that is valid under all distributions [19]. 

The variance estimators of the K-fold crossvalidation estimator of the gen- 
eralization error presented in [92] are almost unbiased in the cases of smooth 
loss functions and the absolute error loss. The problem of variance estimation 
is approached as a problem in approximating the moments of a statistic. The 
estimators depend on the distribution of the errors and on the knowledge of the 
learning algorithm. Overall, a test set that use 25% of the available data seems 
to be a reasonable compromise in selecting among the various forms of K-fold 
crossvalidation [92]. 

The leave-many-out variants of crossvalidation perform better than the leave- 
one-out versions [105]. Empirically, both types of crossvalidation can exhibit high 
variance in small samples, but this may be alleviated by increasing the level of 
resampling. Used appropriately, leave-many-out crossvalidation is, in general, 
more robust than leave-one-out crossvalidation [105]. 

The moment approximation estimator [92] performs better in terms of both 
the variance and the bias than the Nadeau-Bengio estimator [96]. The latter is 
computationally simpler than the former for general loss functions, as it does 
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not require the computation of the derivatives of the loss function; but it is not 
an appropriate one to be used for non-random test set selection. 

Crossvalidation and bootstrapping are both resampling methods. Resampling 
varies the training set numerous times based on one set of available data. One 
fundamental difference between crossvalidation and bootstrapping is that boot- 
strapping resamples the available data at random with replacement, whereas 
crossvalidation resamples the available data at random without replacement. 
Crossvalidation methods never evaluate the trained networks over examples that 
appear in the training set, whereas bootstrapping methods typically do that. 
Crossvalidation methods split the data such that a sample does not appear in 
more than one validation set. Crossvalidation is commonly used for estimating 
generalization error, whereas bootstrapping finds widespread use in estimating 
error bars and confidence intervals. Crossvalidation is commonly believed to be 
more accurate (less biased) than bootstrapping, but to have a higher variance 
than bootstrapping does in small samples [105]. 


Complexity criteria 


An efficient approach for improving the generalization performance is to con- 
struct a small network using a parsimonious principle. Statistical model selec- 
tion with information criteria such as Akaike’s final prediction error criterion [1], 
Akaike information criterion (AIC) [3], Schwartz’s Bayesian information crite- 
rion (BIC) [126], and Rissanen’s minimum description length (MDL) principle 
[113] are popular and have been widely used for model selection of neural net- 
works. Although the motivations and approaches for these criteria may be very 
different from one another, most of them can be expressed as a function with two 
components, one for measuring the training error and the other for penalizing 
the complexity. These criteria penalize large-size models. 

A possible approach to model order selection consists of minimizing the 
Kullback-Leibler discrepancy between the true pdf of the data and the pdf (or 
likelihood) of the model, or equivalently maximizing the relative Kullback-Leibler 
information, which is sometimes called the relative Kullback-Leibler informa- 
tion. Maximizing the asymptotic approximation of the relative Kullback-Leibler 
information with n, the number of variables, is equivalent to minimizing the AIC 
function of n. AIC is derived by maximizing an asymptotically unbiased estimate 
of the relative Kullback-Leibler information J. The BIC rule can be derived from 
an asymptotically unbiased estimate of the relative Kullback-Leibler information 
[133]. BIC is the penalized ML method. 

The AIC and BIC criteria can be, respectively, represented by 


Bugs -5n (Ly (Wy)) + =. (2.5) 
Ep = = In (Ly (Wy)) + e InN, (2.6) 
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where Ly (Wx) is the likelihood estimated for a training set of size N and 


model parameters Wy, and Np is the number of parameters in the model. 
More specifically, the two criteria can be expressed by [133] 


2N 

AIC(Np) = Remp(Np) + are (2.7) 
Ne ys 

BIC(Np) = Remp(Np) + NI InN, (2.8) 


where G? denotes an estimate of noise variance, and the empirical risk is given 
by 


Remp( Np) = 2 di = fle, Np))?, (2.9) 


and the noise variance can be estimated, for a linear estimator with Np param- 
eters, as 


N 1 3 
a2 ee ee, a ee: 2 


i=1 


This leads to the following form of AIC known as final prediction error [2]: 
1+% 
FPE(Np) = 1— Be Meme). (2.11) 


The MDL principle stems from coding theory to find as short a description as 
possible of a database with as few symbols as possible [113, 114]. The descrip- 
tion length of the model characterizes the information needed for simultaneously 
encoding a description of the model and a description of the prediction errors of 
the model. The best model is the one with the minimum description length. The 
total description length Emp has three terms: code cost for coding the input vec- 
tors, model cost for defining the reconstruction method, and reconstruction error 
due to reconstruction of the input vector from its code. The description length is 
described by the number of bits. Existing unsupervised learning algorithms such 
as the competitive learning and PCA can be explained using the MDL principle 
[73]. Good generalization can be achieved by encoding the weights with short 
bit-lengths by penalizing the amount of information they contain using the MDL 
principle [73]. The MDL measure can be regarded as an approximation of the 
Bayesian measure, and thus has a Bayesian interpretation. BIC rule has also been 
obtained by an approach based on coding arguments and the MDL principle. 

Generalization error Err is characterized by the sum of the training (approx- 
imation) error err and the degree of optimism OP inherent in a particular esti- 
mate [61], that is, Err = err + OP. Complexity criteria such as BIC can be used 
for estimating OP. 


ww ai bbt.com DOOOO00 


2.4 


Fundamentals of Machine Learning 31 





























x/ 27 


Figure 2.4 Bias and variance. Circles denote examples from a training set. 


Bias and variance 


The generalization error can be represented by the sum of the bias squared plus 
the variance [59]. Most existing supervised learning algorithms suffer from the 
bias-variance dilemma [59]. That is, the requirements for small bias and small 
variance are conflicting and a tradeoff must be made. 

Let f (x; Ww) be the best model in model space. Thus, Ww does not depend on 
the training data. The bias and variance can be defined by [22] 


bias = Es(f(@)) — f (æ; ù), (2.12) 


var = Es ((f(2) - Es(f(@)))*), (2.13) 


where f(a) is the function to be estimated, and Es denotes the expectation 
operation over all possible training sets. Bias is caused by an inappropriate choice 
of the size of a class of models when the number of training samples is assumed 
infinite, while the variance is the error caused by the finite number of training 
samples. 


Example 2.3: An illustration of the concepts of bias and variance in the two- 
dimensional space is shown in Fig. 2.4. f(a; Ù) is the underlying function; f1 (zx) 
and fə(x) are used to approximate f(#;wW): fı(x) is an exact interpolation of 
the data points, while f2(x) is a fixed function independent of the data points. 
For f(x), the bias is zero at the data points and is small in the neighborhood 
of the data points, while the variance is the variance of the noise on the data, 
which could be significant; for f2(a), the bias is high while the variance is zero. 
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The generalized error can be decomposed into a sum of the bias and variance 


Es (if(@) - f(@,w)?) 


= Es ([F(#) — Es(F(@))]?) + Es ([Es(F(@)) — fæ, @))”) 
+2Es ([f(a) — Es(f(x))] [Es(f(@)) — f(a, ))) 
= (Bias)? + Var. (2.14) 


A network with a small number of adjustable parameters gives poor general- 
ization on new data, since the model has very little flexibility and thus yields 
underfitting with a high bias and low variance. In contrast, a network with 
too many adjustable parameters also gives a poor generalization performance, 
since it is too flexible and fits too much of the noise on the training data, thus 
yielding overfitting with a low bias but high variance. The best generalization 
performance is achieved by balancing bias and variance, which optimizes the 
complexity of the model through either finding a model with an optimal size 
or by adding a regularization term in an objective function. For nonparametric 
methods, most complexity criteria based techniques operate on the variance term 
in order to get a good compromise between the contributions made by the bias 
and variance to the error. When the number of hidden cells is increased, the bias 
term is likely to be reduced, whereas the variance would increase. 

For three-layer feedforward networks with Np hidden sigmoidal units, the bias 
and variance are upper bounded explicitly [10] by O (=) and O (2N), 
respectively, where N is the size of the training set and Jı is the dimensionality 
of the feature vectors. Thus when Np is large, the bias is small. However, when 
N is finite, a network with an excessively large space complexity will overfit 
the training set. The average performance can decrease as Np gets larger. As a 
result, a tradeoff needs to be made between the bias and variance. 

While unbiasedness is a beneficial quality of a model selection criterion, a 
low variance is at least as important, as a nonnegligible variance introduces 
the potential for overfitting in model selection as well as in training the model 
[33]. The effects of this form of overfitting are often comparable to differences 
in performance between learning algorithms [33]. This could be ameliorated by 
regularization of the model selection criterion [32]. 


Robust learning 
When the training data is corrupted by large noise, such as outliers, conven- 
tional learning algorithms may not yield acceptable performance since a small 


number of outliers have a large impact on the MSE. An outlier is an observa- 
tion that deviates significantly from the other observations; this may be due 
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to erroneous measurements or noisy data from the tail of the noise distribu- 
tion functions. When noise becomes large or outliers exist, the networks may 
try to fit those improper data and thus, the learned systems are corrupted. The 
Student-t distribution has heavier tails than the Gaussian distribution and is 
therefore less sensitive to any departure of the empirical distribution from Gaus- 
sianity. For nonlinear regression, the techniques of robust statistics [77] can be 
applied to deal with the outliers. The M/-estimator is derived from the ML esti- 
mator to deal with situations, where the exact probability model is unknown. 
The M-estimator replaces the conventional squared error term by the so-called 
loss functions. The loss function is used to degrade the effects of those outliers 
in learning. A difficulty is the selection of the scale estimator of the loss function 
in the M-estinator. 
The cost function of a robust learning algorithm is defined by 


N 
E, = Y` o (eb), (2.15) 
i=1 
where o(-) is the loss function, which is a symmetric function with a unique 
minimum at zero, 8 > 0 is the scale estimator, known as the cutoff parameter, 
ci is the estimated error for the ith training pattern, and N is the size of the 
training set. The loss function can be typically selected as one of the following 
functions: 


e The logistic function [77] 


2 
olei; 3) = Fin (1 F £) l (2.16) 
e Huber’s function [77] 
le le; | < B 
ole: p= 2" =o: 2.17 
i= {Bieler ase ee 
e Talwar’s function [43] 
12, lel <8 
o (eb) =4 720 a ae 2.18 
O= 1TH (Sa ee 
e Hampel’s tanh estimator [35] 
se, lé,| < bı 
e e22 (82-6 
o (6:3 Br, Be) = 4 4G? — 2am eID iel Bi), Br < lela pr 
367 — In Sates — 1 (62-1), lel > Ba 
(2.19) 


In the tanh estimator, 6; and (2 are two cutoff points, and constants cı and 
c2 adjust the shape of the influence function (to be defined in (2.21)). When 
C= AEE the influence function is continuous. In the interval of the 
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Figure 2.5 Loss functions and their respective influence functions. For logistic, Huber’s, and Talwart’s 
functions, 3 = 1. For Hampel’s tanh estimator, 3; = 1, 82 = 2, cp = 1, and cı = 1.313. (a) Loss 
functions ø. (b) Influence functions y. 


two cutoff points, the influence function can be represented by a hyperbolic 
tangent relation. 

Using the gradient-descent method, the weights are updated by 
N 


= -n > y (e; 8) 


i=1 


OE, 
OW jk 








Awjk = —n m (2.20) 


where 77 is a learning rate or step size, and ¢(-), called the influence function, is 


given by 
Oo (eci; b 
p (ei; b) = -n L (2.21) 
Ei 
The conventional MSE function corresponds to o (e;) = 4e? and y (ei; 8) = &. 


To suppresses the effect of large errors, loss functions used for robust learning 
are defined such that y (€;; 3) is sublinear. 


Example 2.4: The loss functions given above and their respective influence func- 
tions are illustrated in Fig. 2.5. 


T-estimator [138] can be viewed as an M-estimator with an adaptive bounded 
influence function y(-) given by the weighted average of two functions y1(-) and 
(y2(-), with yi(-) corresponding to a very robust estimate and yo(-) to a highly 
efficient estimate. 7T-estimator simultaneously has a high breakdown point and a 
high efficiency under Gaussian errors. 

When the initial weights are not properly selected, the loss functions may not 
be able to correctly discriminate against the outliers. The selection of 8 is also a 
problem, and one approach is to select @ as the median of the absolute deviation 
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Figure 2.6 Architecture of a neural network processor. 


(MAD) 


B = c x median (|e; — median (e;)|) 


35 


(2.22) 


with c chosen as 1.4826 [77]. Some other methods for selecting @ are based on 
using the median of all errors [77], or counting out a fixed percentage of points 


as outliers [35]. 


2.6 Neural network processors 


A typical architecture of a neural network processor is illustrated in Fig. 2.6. 
It is composed of three components: input preprocessing, a neural network for 


performing inversion, and output postprocessing. Input preprocessing is used 


to remove redundant and/or irrelevant information in order to achieve a small 


network and to reduce the dimensionality of the signal parameter space, thus 


improving the generalization capability of the network. Postprocessing the out- 


put of the network generates the desired information. 


Preprocessing is to transform the raw data into a new representation before 
being presented to a neural network. If the input data is preprocessed at the 
training stage, accordingly at the generalizing stage, the input data also needs 


to be preprocessed before being passed on to the neural network. Similarly, if 
the output data is preprocessed at the training stage, the network output at the 
generalization stage is also required to be postprocessed to generate the target 


output corresponding to the raw output patterns. 


For high-dimensional data, dimensionality reduction is the key to cope with the 
curse of dimensionality. Preprocessing has a significant influence on the general- 


ization performance of a neural network. This process removes the redundancy 


in the input space and reduces the space of the input data, thus usually resulting 


in a reduction in the amount or the dimensionality of the input data. This helps 
to alleviate the problem of the curse of dimensionality. A network with prepro- 
cessed inputs may be constrained by a smaller data set, and thus one needs only 
to train a small network, which also achieves a better generalization capability. 

Preprocessing usually takes the form of linear or nonlinear transformation of 
the raw input data to generate input data for the network. It can also be based 
on the prior knowledge of the network architecture, or the problem itself. When 
preprocessing removes redundant information in the input data, it also results 


in a loss of information. Thus, preprocessing should retain as much relevant 


information as possible. 
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The data are sometimes called features, and preprocessing for input raw data 
can be either feature selection or feature extraction. Feature selection concen- 
trates on selecting from the original set of features a smaller subset of salient 
features, while feature extraction is to combine the original features in such a 
way as to produce a new reduced set of salient features. The raw data may be 
orders of magnitude in range, and linear scaling and data whitening of the raw 
data are usually employed as a preprocessing step. 

When some examples of the raw data suffer from missing components, one 
simple treatment is to discard those examples from the dataset. This, however, 
is applicable only when the data is abundant, the percentage of examples with 
missing components is small, and the mechanism for loss of data is independent 
of the data itself. However, this may lead to a biased subset of the data. Methods 
should replace the missing value with a value substituted according to various 
criteria. For function approximation problems, one can represent any variable 
with a missing value as a regression over the other variables using the available 
data, and then find the missing value by interpolating the regression function. 
For density estimation problems, the ML solution to problems with the missing 
data can be found by applying an expectation-maximization (EM) algorithm. 

Feature extraction reduces the dimension of the features by orthogonal trans- 
forms. The extracted features do not have any physical meaning. In comparision, 
feature selection decreases the size of the feature set or reduces the dimension of 
the features by discarding the raw information according to a criterion. 


Feature selection 

Feature selection is to select the best subset or the best subspace of the features 
out of the original set, since irrelevant features degrade the performance. A cri- 
terion is required to evaluate each subset of the features so that an optimum 
subset can be selected. The selection criterion should be the same as that for 
assessing the complete system, such as the MSE criterion for function approxi- 
mation and the misclassification rate for classification. Theoretically, the global 
optimum subset of the features can only be selected by an exhaustive search of 
all the possible subsets of the features. 

Feature selection algorithms can be categorized as either filter or wrapper 
approaches. During the process of feature selection, the generalization ability of 
a subset of features needs to be estimated. This type of feature selection is called 
a wrapper method [79]. The problem of searching the best r variables is solved by 
means of a greedy algorithm based on backward selection [79]. The filter approach 
basically pre-selects the features, and then applies the selected feature subset to 
the clustering algorithm. Filter-based greedy algorithms using the sequential 
selection of the feature with the best criterion value are computationally more 
efficient than wrappers. In general, the wrapper method outperforms the filter 
method, but at the expense of training a large number of classifiers. 

Some nonexhaustive search methods such as the branch and bound procedure, 
sequential forward selection, and sequential backward elimination are discussed 
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n [22]. Usually, backward selection is slower but is more stable in selecting 
optimal features than forward selection. Backward selection starts from all the 
features and deletes one feature at a time, which deteriorates the selection crite- 
rion the least, until the selection criterion reaches a specified value. In contrast, 
forward selection starts from an empty set of features and adds one feature at a 
time that improves the selection criterion the most. 

Mutual information based feature selection is a common method for feature 
selection [15, 54]. The mutual information measures the arbitrary dependence 
between random variables, whereas linear relations, such as the correlation-based 
methods, are prone to mistakes. By calculating the mutual information, the 
importance levels of the features are ranked based on their ability to maximize 
the evaluation criterion. Relevant inputs are found by estimating the mutual 
information between the inputs and the desired outputs. The normalized mutual 
information feature selection [54] does not require a user-defined parameter. 


Feature extraction 

Feature extraction is usually conducted by using orthogonal transforms, though 
the Gram-Schmidt orthonormalization (GSO) is more suitable for feature selec- 
tion. This is due to the fact that the physically meaningless features in the 
Gram-Schmidt space can be linked back to the same number of variables of the 
measurement space, resulting in no dimensionality reduction. In situations where 
the features are used for pattern understanding and analysis, the GSO transform 
provides a good option. 

The advantage of employing an orthogonal transform is that the correlations 
among the candidate features are decomposed so that the significance of the indi- 
vidual features can be evaluated independently. PCA is a well-known orthogonal 
transform. Taking all the data into account, PCA computes vectors that have 
the largest variance associated with them. The generated PCA features may 
not have clear physical meanings. Dimensionality reduction is achieved by drop- 
ping the variables with insignificant variance. Projection pursuit [57] is a general 
approach to feature extraction, which extracts features by repeatedly choosing 
projection vectors and then orthogonalizing. 

PCA is often used to select inputs, but it is not always useful, since the variance 
of a signal is not always related to the importance of the variable, for example, 
for non-Gaussian signals. An improvement on PCA is provided by nonlinear gen- 
eralizations of PCA, which extend the ability of PCA to incorporate nonlinear 
relationships in the data. ICA can extract the statistically independent compo- 
nents from the input data set. It is to estimate the mutual information between 
the signals by adjusting the estimated matrix to give outputs that are maximally 
independent [8]. The dimensions to remove are those that are independent of the 
output. LDA searches for those vectors in the underlying space that best dis- 
criminate among the classes (rather than those that best describe the data). In 
[84], the proposed scheme for linear feature extraction in classification is based 
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on the maximization of the mutual information between the features extracted 
and the classes. 

For time/frequency-continuous signal systems such as speech-recognition sys- 
tems, the fixed time-frequency resolution FFT power spectrum, and the mul- 
tiresolution discrete wavelet transform and wavelet packets are usually used for 
feature extraction. The features used are chosen from the Fourier or wavelet 
coefficients having high energy. The cepstrum and its time derivative remain a 
most commonly used feature set [104]. These features are calculated by taking 
the discrete cosine transform (DCT) of the logarithm of the energy at the out- 
put of a Mel filter and are commonly called Mel frequency cepstral coefficients 
(MFCC). In order to have the temporal information, the first and second time 
derivatives of the MFCC are taken. 


Criterion functions 


The MSE is by far the most popular measure of error. This error measure ensures 
that a large error receives much greater attention than a small error. The MSE 
criterion is optimal and results in an ML estimation of the weights if the dis- 
tributions of the feature vectors are Gaussian [121]. This is desired for most 
applications. In some situations, other error measures such as the mean absolute 
error, maximum absolute error, and median squared error, may be preferred. 

The logarithmic error function, which takes the form of the instantaneous 
relative entropy or Kullback-Leibler divergence criterion, has some merits over 
the MSE function [16] 


Jm 


BW = $90 (0 +o) m (THE) +0 -upam (1e) e2) 


i=l F Yp,i ~ Ypi 





for the tanh activation function, where yp, € (—1, 1). For the logistic activation 
function, the criterion can be written as [93] 


Jm 


B(W)=5>) Ln In (=) re cee (—)| (2.24) 


i=1 Ypi l= Opi 





where yp; € (0,1). In the latter case, Ypi, Jp,i, 1 — Yp,i, and 1 — Ûp,i are regarded 
as probabilities. These criteria take zero only when yp; = Jpi,7 = 1,..., JM, and 
are strictly positive otherwise. Another criterion function obtained by simplifying 
(2.24) via omitting the constant terms related to the patterns is [132] 


Jm 
Ep(W) = -FÙ lupa mbp +C- ypa) I0 (1 pa]. (228) 


i=1 

The problem of loading a set of training examples onto a neural network is NP- 
complete [25, 131]. As a consequence, existing algorithms cannot be guaranteed 
to learn the optimal solution in polynomial time. In the case of one neuron, the 
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logistic function paired with the MSE function can lead to (+) j local minima, 
for N training patterns and an input dimension of Jı [6], while with the entropic 
error function, the error function is convex and thus has only one minimum 
[16, 132]. The use of the entropic error function considerably reduces the total 
number of local minima. 

The BP algorithm derived from the entropy criteria can partially solve the flat- 
spot problem. These criteria do not add computation load to calculate the error 
function. They, however, remarkably reduce the training time, and alleviate the 
problem of getting stuck at local minima by reducing the density of local minima 
[93]. Besides, the entropy-based BP is well suited to probabilistic training data, 
since it can be viewed as learning the correct probabilities of a set of hypotheses 
represented by the outputs of the neurons. 

Traditionally, classification problems are learned through error backpropaga- 
tion by providing a vector of hard 0/1 target values to represent the class label of 
a particular pattern. Minimizing an error function with hard target values tends 
to a saturation of weights, leading to overfitting. The magnitude of the weights 
plays a more important role in generalization than the number of hidden nodes 
[12]. Overfitting might be reduced by keeping the weights smaller. 

The cross-entropy cost function can also be derived from the ML principle 


N C 
EcE = — 5 a thi In(y,c) (2.26) 


i=1 k=1 
for training set {x;,t;}, C classes and N samples, and t,,; € {0,1}. 

Marked reductions on convergence rates and density of local minima are 
observed due to the characteristic steepness of the cross-entropy function 
(93, 132]. As a function of the absolute errors, MSE tends to produce large 
relative errors for small output values. As a function of the relative errors, cross- 
entropy is expected to estimate more accurately small probabilities [62, 72, 132]. 
When a neural network is trained using MSE or cross-entropy minimization, 
its outputs approximate the posterior probabilities of class membership. Thus, 
in the presence of large data sets, it tends to produce optimal solutions in the 
Bayes sense. However, minimization of the error function does not necessarily 
imply misclassification minimization in practice. Suboptimal solutions may occur 
due to flat regions in weight space. Thus, minimization of these error functions 
does not imply misclassification minimization. 

The MSE function can be obtained by the ML principle assuming the indepen- 
dence and Gaussianity of the target data. However, the Gaussianity assumption 
of the target data in classification is not valid, due to its discrete nature of class 
labels. Thus, the MSE function is not the most appropriate one for data classi- 
fication problems. Nevertheless, when using a 1-out-of-C coding scheme for the 
targets, with large N and a number of samples in each class, the MSE trained 
outputs of the network approximate the posterior probabilities of the class mem- 
bership [62]. The cross-entropy error function and other entropy-based functions 
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are suitable for training neural network classifiers, because when interpreting the 
outputs as probabilities this is the optimal solution. 

Classification-based (CB) error functions [110] heuristically seek to directly 
minimize classification error by backpropagating network error only on misclas- 
sified patterns. In so doing, they perform relatively minimal updates to network 
parameters in order to discourage premature weight saturation and overfitting. 
CB3 is a CB approach that learns the error function to be used while training 
[111]. This is accomplished by learning pattern confidence margins during train- 
ing, which are used to dynamically set output target values for each training 
pattern. In fact, CB3 saves time by omitting the error backpropagation step 
for correctly classified patterns with sufficient output confidences. The number 
of epochs required to converge is similar for CB3 and cross-entropy training, 
generally about half as many epochs required for MSE training. 

The MSE criterion can be generalized into the Minkowski-r metric [64] 


Im 


1 i ʻi 
Pp = T > lpi — Ypal - (2.27) 
i=1 


When r = 1, the metric is called the city block metric. The Minkowski-r metric 
corresponds to the MSE criterion for r = 2. A small value of r (r < 2) reduces 
the influence of large deviations, thus it can be used in the case of outliers. In 
contrast, a large r weights large deviations, and generates a better generation 
surface when the noise is absent in the data or when the data clusters in the 
training set are compact. 

A generalized error function embodying complementary features of other func- 
tions, which can emulate the behavior of other error functions by adjustment of 
a single real-valued parameter, is proposed in [130]. Many other criterion func- 
tions can be used for deriving learning algorithms, including those based on 
robust statistics [77] or regularization [106]. 


Computational learning theory 


Machine learning makes predictions about the unknown underlying model based 
on a training set drawn from hypotheses. Due to the finite training set, learn- 
ing theory cannot provide absolute guarantees of performance of the algorithms. 
The performance of learning algorithms is commonly bounded by probabilis- 
tic terms. Computational learning theory is a statistical tool for the analysis of 
machine learning algorithms, that is, for characterizing learning and generaliza- 
tion. Computational learning theory addresses the problem of optimal generaliza- 
tion capability for supervised learning. Two popular formalisms of approaches to 
computational learning theory are the VC theory [144] and the probably approx- 
imately correct (PAC) learning [143]. Both approaches are nonparametric and 
distribution-free learning models. 
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The VC theory [144], known as the statistical learning theory, is a dependency- 
estimation method with finite data. Necessary and sufficient conditions for consis- 
tency and fast convergence are obtained based on the empirical risk minimization 
(ERM) principle. Uniform convergence for a given class of approximating func- 
tions is associated with the capacity of the function class considered [144]. The 
VC dimension of a function class quantifies its classification capabilities [144]. It 
indicates the cardinality of the largest set for which all possible binary-valued 
classifications can be obtained using functions from the class. The capacity and 
complexity of the function class is measured in terms of the VC dimension. The 
ERM principle has been practically applied in the SVM [147]. The VC theory 
provides a general measure of complexity, and gives associated bounds on the 
optimism. 

PAC learning [143] aims to find a hypothesis that is a good approximation to 
an unknown target concept with a high probability. The PAC learning paradigm 
is intimately associated with the ERM principle. A hypothesis that minimizes 
the empirical error, based on a sufficiently large sample, will approximate the 
target concept with a high probability. The generalization ability of network 
training can be established estimating the VC dimension of neural architectures. 
Boosting [123] is a PAC learning-inspired method for supervised learning. 


Vapnik-Chervonenkis dimension 


The VC dimension is a combinatorial characterization of the diversity of func- 
tions that can be computed by a given neural architecture. It can be viewed as a 
generalization of the concept of capacity first introduced by Cover [44]. The VC 
dimension can be regarded as a measure of the capacity or expressive power of 
a network. VC dimension is the measure of model complexity (capacity) used in 
VC theory. For linear estimators, the VC dimension is equivalent to the number 
of model parameters, but is hard to obtain for other types of estimators. 


Definition 2.1 (VC dimension). A subset S of the domain X is shattered by 
a class of functions or neural network N if every function f : S — {0,1} can be 
computed on N. The VC dimension of N is defined as the maximal size of a set 
S C that is shattered by N 


dimyc (N) = max {|S||S C X is shattered by N}, (2.28) 


where |S| denotes the cardinality of S. 


For example, for a neural network with the relation f(x,w,0) = 
sgn (wa + 0), it can shatter at most any three points in 7, thus its VC dimen- 
sion is 3. This is shown in Fig. 2.7. The points are in general position, that is, 
they are linearly independent. 

A hard-limiter function with threshold 6o is typically used as the activation 
function for binary neurons. The basic function of the McCulloch-Pitts neu- 
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Figure 2.7 Shatter any three points in ¥ into two classes. 


ron has a linear relation applied by a threshold operation, hence called a lin- 
ear threshold gate (LTG). A neural network with LTG has a VC dimension of 
O (Nu log Nw) [14], where Nu is the number of weights in a network. The VC 
dimension has been generalized for neural networks with real-valued output, and 
the VC dimension of various neural networks has been studied in [14]. 

The VC dimension can be used to estimate the number of training examples for 
a good generalization capability. The Boolean VC dimension of a neural network 
N, written dimpyc(WV), is defined as the VC dimension of the class of Boolean 
functions that is computed by M. 

The VC dimension is a property of a set of functions {f(a@)}, and can be 
defined for various classes of function f. The VC dimension for the set of func- 
tions {f(œ)} is defined as the maximum number of training points that can be 
shattered by { f(a)}. If the VC dimension is d, then there exists at least one set 
of d points that can be shattered, but in general it will not be true that every 
set of d points can be shattered. 

Feedforward networks with threshold and logistical activation functions have 
VC dimensions of O(N, In Nu) [17] and O (N2) [80], respectively. Sufficient 
sample sizes are, respectively, estimated by using the PAC paradigm and the VC 
dimension for feedforward networks with sigmoidal neurons [128] and feedforward 
networks with LTGs [17]. These bounds on sample sizes are dependent on the 
error rate of hypothesis € and the probability of failure 6. A practical size of the 
training set for good generalization is N = O (=) [70], where £ specifies the 
accuracy. For example, for an accuracy level of 90%, e = 0.1. 

It is not possible to obtain the analytic estimates of the VC dimension in 
most cases. Hence, a proposal is to measure the VC dimension of an estima- 
tor experimentally by fitting the theoretical formula to a set of experimental 
measurements of the frequency of errors on artificially generated data sets of 
varying sizes [146]. However, with this approach it may be difficult to obtain an 
accurate estimate of the VC dimension due to the variability of random samples 
in the experimental procedure. In [127], this problem is addressed by proposing 
an improved design procedure for specifying the measurement points (i.e., the 


ww ai bt. com DOOOO00 


2.8.2 


Fundamentals of Machine Learning 43 


sample size and the number of repeated experiments at a given sample size). 
This leads to a nonuniform design structure as opposed to the uniform design 
structure used in [146]. The proposed optimized design structure leads to a more 
accurate estimation of the VC dimension using the experimental procedure. A 
more accurate estimation of VC dimension leads to improved complexity control 
using analytic VC-generalization bounds and, hence, better prediction accuracy. 


Empirical risk-minimization principle 


Assume that a set of N samples, {(a;,y;)}, are independently drawn and iden- 
tically distributed (iid) samples from some unknown probability distribution 
p(x,y). Assume a machine defined by a set of possible mappings x — f(x, a), 
where @ contains adjustable parameters. When œ is selected, the machine is 
called a trained machine. 

The expected risk is the expectation of the generalization error for a trained 
machine, and is given by 


R(a) = J L(y, f(«, 0) dp(æ,y), (2.29) 


where L(y, f(x,œ)) is the loss function, measuring the discrepancy between 
the output pattern y and the output of the learning machine f(æ, œ). The loss 
function can be defined in different forms for different purposes: 


L(y, f(@,a)) = T : p = : (for classification), (2.30) 
L(y, f(a, a)) = (y — f(x,&œ))? (for regression), (2.31) 
L(p(a,@)) = —Inp(w,a) (for density estimation). (2.32) 


The empirical risk Remp(@) is defined to be the measured mean error on a 
given training set 
La 
Remp(@) = 37 DL (ir f (#i,0)) (2.33) 
i=1 
The ERM principle aims to approximate the loss function by minimizing the 
empirical risk (2.33) instead of the risk (2.29), with respect to model parameters. 
When the loss function takes the value 0 or 1, with probability 1 — 6, there is 
the upper bound called the VC bound [147]: 


d (In 27 +1) -n3 


R(a) < Rempla) + x 


(2.34) 
where d is the VC dimension of the machine. The second term on the right-hand 


side is called the VC confidence, which monotonically increases with increasing d. 
Reducing d leads to a better upper bound on the actual error. The VC confidence 
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depends on the class of functions, whereas the empirical risk and actual risk 
depend on the particular function obtained by the training procedure. 
For regression problems, a practical form of the VC bound is used [148]: 


-1 

R(d) < Remp(d) (: —4/p—plnp+ z) ; (2.35) 

where p = # and d is the VC dimension. The VC bound (2.35) is a special case 

of the general analytical bound [147] with appropriately chosen practical values 
of theoretical constants. 

The principle of structural risk minimization (SRM) minimizes the risk func- 
tional with respect to both the empirical risk and the VC dimension of the set of 
functions; thus, it aims to find the subset of functions that minimizes the bound 
on the actual risk. The SRM principle is crucial to obtain good generalization 
performances for a variety of learning machines, including SVMs. It finds the 
function that achieves the minimum of the guaranteed risk for the fixed amount 
of data. To find the guaranteed risk, one has to use bounds, e.g., VC bound, on 
the actual risk. Empirical comparisons between AIC, BIC and the SRM method 
are presented for regression problems in [39], based on VC theory. VC-based 
model selection consistently outperforms AIC for all the datasets, whereas the 
SRM and BIC methods show similar predictive performance. 


Function approximation, regularization, risk minimization 
Classical statistics and function approximation/regularization rely on the true 
model that underlies generated data. In contrast, VC learning theory is based on 
the concept of risk minimization, and does not use the notion of a true model. 
The distinction between the three learning paradigms becomes blurred when 
they are used to motivate practical learning algorithms. Least-squares (LS) min- 
imization for function estimation can be derived using the parametric estimation 
approach via ML arguments under Gaussian noise assumptions, and it can alter- 
natively be introduced under the risk minimization approach. SVM methodology 
was originally developed in VC-theory, and later re-introduced in the function 
approximation/regularization setting [68]. An important conceptual contribu- 
tion of the VC approach states that generalization (learning) with finite samples 
may be possible even if accurate function approximation is not [41]. The regular- 
ization program does not yield good generalization for finite sample estimation 
problems. 

In the function approximation theory, the goal is to estimate an unknown true 
target function in regression problems, or posterior probability P(y|a) in classi- 
fication problems. In VC theory, it is to find the target function that minimizes 
prediction risk or achieves good generalization. That is, the result of VC learning 
depends on (unknown) input distribution, while that of function approximation 
does not. The important concept of margin was originally introduced under the 
VC approach, and later explained and interpreted as a form of regularization. 
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However, the notion of margin is specific to SVM, and it does not exist under the 
regularization framework. Any of the methodologies (including SRM and SVM) 
can be regarded as a special case of regularization. 


Probably approximately correct (PAC) learning 


The PAC learning paradigm is concerned with learning from examples of a tar- 
get function called concept, by choosing from a set of functions known as the 
hypothesis space, a function that is meant to be a good approximation to the 
target. 

Let Cn and Hn, n > 1, respectively, be a set of target concepts and a set of 
hypotheses over the instance space {0, 1}”, where Cn C Hn for n > 1. When there 
exists a polynomial-time learning algorithm that achieves low error with high 
confidence in approximating all concepts in a class C = {Cn} by the hypothesis 
space H = {Hn} if enough training data is available, the class of concepts C is 
said to be PAC learnable by H or simply PAC learnable. Uniform convergence 
of the empirical error of a function towards the real error on all possible inputs 
guarantees that all training algorithms that yield a small training error are PAC. 
A function class is PAC learnable if and only if the capacity in terms of the VC 
dimension is finite. 

In this framework, we are given a set of inputs and a hypothesis space of 
functions that maps the inputs onto {0,1}. Assume that there is an unknown 
but usually fixed probability distribution on the inputs, and the aim is to find 
a good approximation to a particular target concept from the hypothesis space, 
given only a random sample of the training examples and the value of the target 
concept on these examples. 

The sample complexity of a learning algorithm, Npac, is defined as the smallest 
number of samples required for learning C by H, that achieve a given approxi- 
mation accuracy e with a probability 1 — ô. Any consistent algorithm that learns 
C by H has a sample complexity with the upper bound [5, 69] 

1 2 

Npac < rey | 5 

In other words, with probability of at least 1 — 6, the algorithm returns a hypoth- 
esis h € Hn with an error less than €. 

In terms of the cardinality of Hn, denoted |H,|, it can be shown [145, 69] that 
the sample complexity is upper bounded by 


(2aimve can +1n ) , VO<d<1. (2.36) 
€ 


1 1 
Npac < - (1 + In 2) : (2.37) 
€ 


For most hypothesis spaces on Boolean domains, the second bound gives a bet- 
ter bound. On the other hand, most hypothesis spaces on real-valued attributes 
are infinite, so only the first bound is applicable. PAC learning is particularly use- 
ful for obtaining upper bounds on sufficient training sample size. Linear thresh- 
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old concepts (perceptrons) are PAC learnable on both Boolean and real-valued 
instance spaces [69]. 


No-free-lunch theorem 


Before the no-free-lunch theorem [151] was proposed, people intuitively believed 
that there exists some universally beneficial algorithms for search, and many peo- 
ple actually made efforts to design some algorithms. The no-free-lunch theorem 
asserts that there is no universally beneficial algorithm. 

The no-free-lunch theorem states that no search algorithm is better than 
another in locating an extremum of a cost function when averaged over the 
set of all possible discrete functions. That is, all search algorithms achieve the 
same performance as random enumeration, when evaluated over the set of all 
functions. 


Theorem 2.1 (No-free-lunch theorem). Given the set of all functions F 
and a set of benchmark functions F,, if algorithm A, is better on average than 
algorithm Ag on Fı, then algorithm Az must be better than algorithm A, on 
FF. 


The performance of any algorithm is determined by the knowledge concerning 
the cost function. Thus, it is meaningless to evaluate the performance of an algo- 
rithm without specifying the prior knowledge. Practical problems always contain 
priors such as smoothness, symmetry, and i.i.d. samples. For example, although 
neural networks are usually deemed a powerful approach for classification, they 
cannot solve all classification problems. For some arbitrary classification prob- 
lems, other methods may be efficient. 

The no-free-lunch theorem was later extended to coding methods, early stop- 
ping [31], avoidance of overfitting, and noise prediction [91]. Again, it has been 
asserted that no one method is better than the others for all problems. 

Following the no-free-lunch theorem, the inefficiency of leave-one-out cross- 
validation was demonstrated on a simple problem in [159]. In response to [159], 
in [63] the strict leave-one-out crossvalidation was shown to yield the expected 
results on this simple problem, thus leave-one-out crossvalidation is not subject 
to the no-free-lunch criticism [63]. Nonetheless, it is concluded in [115] that the 
statistical tests are preferable to crossvalidation for linear as well as for nonlinear 
model selection. 


Neural networks as universal machines 


The power of neural networks stems from their representation capability. On the 
one hand, feedforward networks are proved to offer the capability of universal 
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function approximation. On the other hand, recurrent networks using the sig- 
moidal activation function are Turing equivalent [129] and simulates a universal 
Turing machine; Thus, recurrent networks can compute whatever function any 
digital computer can compute. 


Boolean function approximation 


Feedforward networks with binary neurons can be used to represent logic or 
Boolean functions. In binary neural networks, the input and output values for 
each neuron are Boolean variables, denoted by binary (0 or 1) or bipolar (—1 or 
+1) representation. For Jı independent Boolean variables, there are 27! combi- 
nations of these variables. This leads to a total of 22”! different Boolean functions 
of Jı variables. An LTG can discriminate between two classes. 

The function counting theorem [44], [65] gives the number of linearly separable 
dichotomies of m points in general position in R”. It essentially estimates the 
separating capability of an LTG. 


Theorem 2.2 (Function counting theorem). The number of linearly sepa- 
rable dichotomies of m points in general position in R” is 


20 ("7"), m>n+1 


l 
2m. m<n+1- 


(2.38) 





c(mn) = { 


A set of m points in R” is said to be in general position if every subset of m 
or fewer points is linearly independent. 

The total number of possible dichotomies of m points is 2”. Under the assump- 
tion of 2’ equiprobable dichotomies, the probability of a single LTG with n 
inputs to separate m points in general position is given by 





C(m,n) -D eras m>n+1 
The fraction P(m,n) is the probability of linear dichotomy. Thus, if -4 < 1, 
P=1; if 1< 73 <2 and n> oœ, Pol. At i =2, P=4 Usually, m = 


2(n + 1) is used to characterize the statistical capability of a single LTG. Equa- 
tion (2.39) is plotted in Fig. 2.8. 

A three-layer (Ji-271-1) feedforward LTG network can represent any Boolean 
function with Jı arguments [45, 94]. To realize an arbitrary function f : R” > 
{0,1} defined on N arbitrary points in R”, the lower bound for the number of 


hidden nodes is derived as O (=) for N > 3J, and Jı — oo [65, 16]; for 


Jı logs T 
N points in general position, the lower bound is R when Jı — oo [65]. Networks 
with two or more hidden layers are found to be potentially more size efficient 
than networks with a single hidden layer [65]. 
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Figure 2.8 The probability of linear dichotomy of m points in n dimensions. 


Binary radial basis function 
For three-layer feedforward networks, if the activation function of the hidden 
neurons is selected as the binary RBF or generalized binary RBF and the output 
neurons are selected as LTGs, one obtains binary or generalized binary RBF 
networks. Binary or generalized binary RBF network can be used for the mapping 
of Boolean functions. 

The parameters of the generalized binary RBF neuron are the center c € R” 
and the radius r > 0. The activation function ¢: R” — {0,1} is defined by 


1, |æ- ella <r 
= = 2.4 
oim] P otherwise i 20) 
where A is any real, symmetric and positive-definite matrix, and ||- || is the 


weighted Euclidean norm. When A is the identity matrix I, the neuron becomes 
a binary RBF neuron. 

Every Boolean function computed by the LTG can also be computed by any 
generalized binary RBF neuron, and generalized binary RBF neurons are more 
powerful than LTGs [58]. As an immediate consequence, in any neural network, 
any LTG that receives only binary inputs can be replaced by a generalized binary 
RBF neuron having any norm, without any loss of the computational power of 
the neural network. 

Given a Jı-J2-1 feedforward network, whose output neuron is an LTG; we 
denote the network as Ni, M2, and N3, when the Jz hidden neurons are respec- 
tively selected as LTGs, binary RBF neurons, and generalized binary RBF neu- 
rons. The VC dimensions of the three networks have the relation [58] 


dimpyc (M1) = dimpyvc (V2) < dimpyc (M3). (2.41) 
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When Jı > 3 and J2 < SS the lower bound for the three neural networks 
27. 


is given as [13, 58] 
dimpyc (Ni) = JiJ2 + 1. (2.42) 


Linear separability and nonlinear separability 


Definition 2.2 (Linearly separable). Assume that there is a set X of N 
patterns xi of Jı dimensions, each belonging to one of two classes Cı and C2. If 
there is a hyperplane that separates all the samples of Cı from C2, then such a 
classification problem is said to be linearly separable. 


A single LTG can realize linearly separable dichotomy function, characterized 
by a linear separating surface (hyperplane) 


wg + wo = 0, (2.43) 


where w is a Jı-dimensional vector and wọ is a bias toward the origin. For a 
pattern, if wT æ + wo > 0, it belongs to C1; if wTæ + wo < 0, it belongs to Co. 


Definition 2.3 (p separable). A dichotomy {C1,C2} of set X is said to be 
-separable if there exists a mapping yp: R” — R? that satisfies a separating 


surface [44] 
wTy(æ)=0 (2.44) 


such that wT p(x) >0 if £x € Cı and wTy(æ) <0 if x € Co. Here w is a Jz- 
dimensional vector. 


A linearly inseparable dichotomy can become nonlinearly separable. As shown 
in Fig. 2.9, the two linearly inseparable dichotomies become y-separable. 

The nonlinearly separable problem can be realized by using a polynomial 
threshold gate, which changes the linear term in the LTG into high-order poly- 
nomials. The function counting theorem is applicable to polynomial threshold 
gates; and it still holds true if the set of m points is in general position in y-space, 
that is, the set of m points is in y-general position. 


Example 2.5: Some examples of linearly separable classes and linearly insepara- 
ble classes in two-dimensional space are illustrated in Fig. 2.9. (a) Two linearly 
separable classes with x; — x2 = 0 as the delimiter. (b) and (c) are linearly insep- 
arable classes, where (b) is the exclusive-or problem. Note that the linearly insep- 
arable classification problems in cases (b) and (c) become nonlinearly separable, 
when the separating surfaces are ellipses. 
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Figure 2.9 Linearly separable, linearly inseparable, and nonlinearly separable classification in 
two-dimensional space. Dots and circles denote patterns of different classes. 


2.10.3 


Higher-order neurons (or X-II units) are simple but powerful extensions of lin- 
ear neuron models. They introduce the concept of nonlinearity by incorporating 
monomials, that is, products of input variables, as a hidden layer. Higher-order 
neurons with k monomials in n variables are shown to have VC dimension at 
least nk + 1 [124]. 


Continuous function approximation 


A three-layer feedforward network with a sufficient number of hidden units can 
approximate any continuous function to any degree of accuracy. This is guaran- 
teed by Kolmogorov’s theorem [81, 71]. 


Theorem 2.3 (Kolmogorov). Any continuous real-valued function 
f (@1,---;%n) defined on [0,1]", n > 2, can be represented in the form 


2n+1 


f (v1,..-,%n) = 2 hj (>: Wij wo), (2.45) 


where hj and pij are continuous functions of one variable, and pij are mono- 
tonically increasing functions independent of f. 


Kolmogorov’s theorem is the famous solution to Hilbert’s 13th problem. 
According to Kolmogorov’s theorem, a continuous multivariate function on a 
compact set can be expressed using superpositions and compositions of a finite 
number of single-variable functions. Based on Kolmogorov’s theorem, Hecht- 
Nielsen provided a theorem that is directly related to neural networks [71]. 


Theorem 2.4 (Hecht-Nielsen). Any continuous real-valued mapping f: 


(0, 1)" — R™ can be approximated to any degree of accuracy by a feedforward 
network with n input nodes, 2n + 1 hidden units, and m output units. 
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The Weierstrass theorem asserts that any continuous real-valued multivariate 
function can be approximated to any accuracy using a polynomial. The Stone- 
Weierstrass theorem [119] is a generalization of the Weierstrass theorem, and is 
usually used for verifying a model’s approximation capability to dynamic sys- 
tems. 


Theorem 2.5 (Stone-Weierstrass). Let F be a set of real continuous func- 
tions on a compact domain U ofn dimensions. Let F satisfy the following criteria 


1. Algebraic closure: F is closed under addition, multiplication, and scalar multi- 
plication. That is, for any two fi, f2 E F, we have fifo E F and aı fı + a2 f2 € 
F, where a, and ag are any real numbers. 

2. Separability on U: for any two different points £1, £2 EU, £1 Æ £2, there 
exists f € F such that f (a1) # f (x2); 

3. Not constantly zero on U: for each x € U, there exists f E€ F such that f(a) # 
0. 


Then F is a dense subset of C(U), the set of all continuous real-valued functions 
on U. In other words, for any € > 0 and any function g E€ C(U), there exists 
f EF such that |g(x) — f(x)| < £ for any x EU. 


To date, numerous attempts have been made in searching for suitable forms of 
activation functions and proving the corresponding network’s universal approx- 
imation capabilities. Universal approximation to a given nonlinear functional 
under certain conditions can be realized by using the classical Volterra series or 
the Wiener series. 


Winner-takes-all 


The winner-takes-all (WTA) competition is widely observed in both inanimate 
and biological media and society. Theoretical analysis [89] shows that WTA is 
a powerful computational module in comparison with threshold gates and sig- 
moidal gates (i.e., McCulloch-Pitts neurons). An optimal quadratic lower bound 
is given in [89] for computing WTA in any feedforward circuit consisting of 
threshold gates. Arbitrary continuous functions can be approximated by circuits 
employing a single soft WTA gate as their only nonlinear operation [89]. 


Theorem 2.6 (Maass, 1 [89]). Assume that WTA with n > 3 inputs is com- 
puted by some arbitrary feedforward circuit C consisting of threshold gates with 
arbitrary weights. Then C consists of at least (5) +n threshold gates. 


Theorem 2.7 (Maass, 2 [89]). Any two-layer feedforward circuit C (with m 
analog or binary input variables and one binary output variable) consisting of 
threshold gates can be simulated by a circuit consisting of a single k-winner- 
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take-all gate (k-WTA) applied to n weighted sums of the input variables with 
positive weights, except for some set S C R™ of inputs that has measure 0. 

Any boolean function f : {0,1}" — {0,1} can be computed by a single k- WTA 
gate applied to weighted sums of the input bits. If C has polynomial size and 
integer weights and the size is bounded by a polynomial in m, then n can be 
bounded by a polynomial in m, and all weights in the simulating circuit are 
natural numbers and the circuit size is bounded by a polynomial in m. 


For real valued input (21,...,2,), a soft-WTA has its output (r1,...,7n), 
analog numbers r; reflecting the relative position of x; within the ordering of 
xi. Soft-WTA is plausible as computational function of cortical circuits with 
lateral inhibition. Single gates from a fairly large class of soft-WTA gates can 
serve as the only nonlinearity in universal approximators for arbitrary continuous 
functions. 


Theorem 2.8 (Maass, 3 [89]). Assume that h: D — [0,1] is an arbitrary 
continuous function with a bounded and closed domain D C R™. Then for any 
c > 0 and for any function g satisfying above conditions there exist natural num- 
bers k, n, biases a’, € R, and coefficients al >0 fori = lype m] =1,...,n, 
so that the circuit consisting of the soft-WTA gate soft-WTA% , applied to the 
n sums X; alzi + a’, for j =1,...,n computes a function f: D — [0,1] so 
that | f(z) — h(z)| < e for all z € D. Thus, circuits consisting of a single soft- 
WTA gate applied to positive weighted sums of the input variables are universal 
approzimators for continuous functions. 


Compressed sensing and sparse approxiation 


A rational behind sparce coding is the sparse connectivity between neurons in 
human brain. In the sparse coding model for the primary visual cortex, a small 
subset of learned dictionary elements will encode most natural images, and only 
a small subset of the cortical neurons need to be active for representing the 
high-dimensional visual inputs [100]. In a sparse representation a small number 
of coefficients contain a large portion of the energy. Sparse representations of 
signals are of fundamental importance in fields such as blind source separation, 
compression, sampling and signal analysis. 

Compressed sensing, or compressed sampling, integrates the signal acquisi- 
tion and compression steps into a single process. It is an alternative to Shan- 
non/Nyquist sampling for the acquisition of sparse or compressible signals that 
can be well approximated by just K < N elements from an N-dimensional basis 
[48], [28]. Compressed sensing allows perfect recovery of sparse signals (or sig- 
nals sparse in some basis) using only a small number of random measurements. 
In practice, signals tend to be compressible, rather than sparse. Compressible 
signals are well approximated by sparse signals. 
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Modeling a target signal as a sparse linear combination of atoms (elementary 
signals) drawn from a dictionary (a fixed collection), known as sparse coding, has 
become a popular paradigm in many fields, including signal processing, statistics, 
and machine learning. Many signals like audio, images and video can be efficiently 
represented by sparse coding. Sparse coding is also a type of matrix factorization 
technique. The goal of sparse coding is to learn an over-complete basis set that 
represents each data point as a sparse combination of the basis vectors. 


Compressed sensing 


Compressed sensing relies on two fundamental properties: Compressibility of the 
data and acquiring incoherent measurements. A signal æ is said to be compress- 
ible if there exists a dictionary ® such that a = Tæ are sparsely distributed. 
In compressed sensing, a signal x € C is acquired by collecting data of linear 
measurements 


y=Az+n, (2.46) 


where the random matrix A is an M x N “sampling” or measurement matrix, 
with the number of measurements M in y € C™ smaller than the number of 
samples N in æ: M < N, and n is a noise term. It focuses on underdetermined 
problems where the forward operator A € C™*% has unit-norm columns and 
forms an incomplete basis with M « N. 

In compressed sensing, random distributions for generating A have to satisfy 
the so-called restricted isometry property (RIP) in order to preserve the infor- 
mation in sparse and compressible signals, and ensure a stable recovery of both 
sparse and compressible signals x [28]. A large class of random matrices have 
the RIP with high probability. 


Definition 2.4 (K-restricted isometry property (K-RIP), [28]). The 
M x N matriz A is said to satisfy the K -restricted isometry property (K-RIP) 
if there exists a constant € (0,1) such that 


(1—d)||all2 < Axl} < (1+ 8) |le[l3 (2.47) 


holds for any K-sparce vector x of length N. A vector x is said to be K-sparse 
when ||x\|o < K, where ||- |lo: RY — R is Lo-norm, which returns the number 
of nonzero elements in its argument, i.e., when x has at most K nonzero entries. 
The minimum of all constants 6 € (0,1) that satisfy (2.47) is referred to as the 
restricted isometry constant OK. 


In other words, K-RIP ensures that all submatrices of A of size M x K are 
close to an isometry, and therefore distance (and information) preserving. The 
goal is to push M as close as possible to K in order to perform as much signal 
compression during acquisition as possible. RIP measures the orthogonality of 
column vectors of a dictionary. 


ww ai bbt.com DOOOO000 


54 


2.11.2 


Chapter 2. Fundamentals of Machine Learning 


Compressed sensing can achieve stable recovery of compressible, noisy signals 
through the solution of the computationally tractable Lı regularized inverse 
problem 


min ||z||1 subject to || Aæ- yll? < æ. (2.48) 


LP is the reconstruction method that achieves the best sparsity-undersampling 
tradeoff, but having a high computational cost for large-scale applications. 
LASSO [139] and approximate message-passing [50] are well-known low- 
complexity reconstruction procedures. 

The popular least absolute selection and shrinkage operator (LASSO) min- 
imizes a weighted sum of the residual norm and a regularization term ||a||1. 
LASSO has the ability to reconstruct sparse solutions when sampling occurs far 
below the Nyquist rate, and also to recover the sparsity pattern exactly with 
probability one, asymptotically as the number of observations increases. The 
approximate message-passing algorithm [50] is an iterative-thresholding algo- 
rithm, leading to the sparsity-undersampling tradeoff equivalent to that of the 
corresponding LP procedure while running dramatically faster. 

Standard compressed sensing dictates that robust signal recovery is possible 
from O(K log(N/K)) measurements. A model-based compressed sensing theory 
[9] provides concrete guidelines on how to create model-based recovery algo- 
rithms with provable performance guarantees. Wavelet trees and block sparsity 
are integrated into two compressed sensing recovery algorithms (Matlab toolbox, 
http://dsp.rice.edu/software) and they are proved to offer robust recovery 
from just O(K) measurements [9]. 


Sparse approximation 


With a formulation similar to that of compressed sensing, sparse approximation 
has a different objective. Assume that a target signal y € RM can be repre- 
sented exactly (or at least approximated with sufficient accuracy) by a linear 
combination of exemplars in the overcomplete dictionary A = (a1, @2,...,@N): 


y = Aa, (2.49) 


where A is a real M x N matrix with N > M whose columns have unit 
Euclidean norm: ||a;||2 = 1, j = 1,2,..., N, and æ € RN. In fact, since the dic- 
tionary is overcomplete, any vector can be represented as a linear combination 
of vectors from the dictionary. 

Sparsity has emerged as a fundamental type of regularization. Sparse approx- 
imation [36] seeks an approximate solution to (2.49) while requiring that the 
number K of nonzero entries of æ is only a few relative to its dimension N. 
Compressive sensing is a specific type of sparse approximation problem. 
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Although the system of linear equations in (2.49) has no unique solution, if æ 
is sufficiently sparse, x can be uniquely determined by solving [48] 


E=argmingcrn||€|lo subject to y= Ag. (2.50) 


The combinatorial problem (2.50) is NP-hard [97]. 
With weak conditions on A, the solution of the Zo-norm minimization given 
by (2.50) is equal to the solution of an Zy-norm minimization [49] 


x = arg mingern||£||1 subject to y= Az. (2.51) 


This convex minimization problem is the same as that given by (2.48). This 
indicates that the problems of recovering sparse signals from compressed mea- 
surements and constructing sparce approximation are the same in nature. 

Lı regularization gives rise to convex optimization and are thus widely used 
for generating results with sparse loadings, whereas Lo regularization does not 
and furthermore yields NP-hard problems. However, sparsity is better achieved 
with Lo penalties based on prediction and false discovery rate arguments [86]. 
The relation between Lı and Lo has been studied in the compressed sensing 
literature [48]. Under the RIP condition the Lı and Lo solutions are equal [49]. 
However, Lı regularization may cause biased estimation for large coefficients 
since it over-penalizes true large coefficients [158]. 

As an extension of sparce approximation, the recovery of a data matrix from a 
sampling of its entries is considered in [29]. It is proved that a matrix X € R™*" 
of rank r, r < min(m,n), can be perfectly recovered from a number k of entries 
selected uniformly at random from the matrix with very high probability if k 
obeys a certain condition [29]. The matrix completion problem is formulated as 
finding the matrix with minimum nuclear norm that fits the data. This can be 
solved using iterative singular-value thresholding [27]. 


LASSO and greedy pursuit 


The convex minimization problem given by (2.51) or (2.48) can be cast as an LS 
problem with Lı penalty, also referred to as LASSO [139] 


z = arg mingcry {| A& — yll3 +Allžll1} (2.52) 


with regularization parameter A. Public domain software packages exist to solve 
problem (2.52) efficiently. 

LASSO is probably the most popular supervised-learning technique that has 
been proposed to recover sparse signals from high-dimensional measurements. 
LASSO shrinks certain regression coefficients to zero, giving interpretable models 
that are sparse. It minimizes the sum of squared errors, given a fixed bound on 
the sum of absolute value of the regression coefficients. LASSO and many Lı- 
regularized regression methods typically need to set a regularization parameter. 

LASSO solves a robust optimization problem. The sparsity and consistency 
of LASSO are shown based on its robustness interpretation [153]. Furthermore, 
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the robust optimization formulation is shown to be related to kernel density 
estimation. A no-free-lunch theorem is proved in [153] which states that sparsity 
and algorithmic stability contradict each other, and hence LASSO is not stable. 
An asymptotic analysis shows that the asymptotic variances of some of the robust 
versions of LASSO estimators are stabilized in the presence of large variance 
noise, compared with the unbounded asymptotic variance of the ordinary LASSO 
estimator [37]. 

The LS ridge regression model estimates linear regression coefficients, with the 
Lə ridge regularization on coefficients. To better identify important features in 
the data, LASSO uses the Lı penalty instead of using the Lə ridge regularization. 
LASSO, or Lı regularized LS, has been explored extensively for its remarkable 
sparsity properties. For sparse, high-dimensional regression problems, marginal 
regression, where each dependent variable is regressed separately on each covari- 
ate, computes the estimates roughly two orders of magnitude faster than the 
LASSO solutions [60]. 

A greedy algorithm is usually used for solving the convex minimization prob- 
lem given by (2.51) or (2.48). Basis pursuit [36] is a greedy sparse approximation 
technique for decomposing a signal into a superposition of dictionary elements 
(basis functions), which has the smallest Lı norm of coefficients among all such 
decompositions. It is implemented as pdco and SolveBP in the SparseLab tool- 
box (http: //sparselab.stanford.edu). A similar method [53] is implemented 
as SolveLasso in the SparseLab toolbox. 

Orthogonal matching pursuit (OMP) [102] is the simplest effective greedy algo- 
rithm for sparse approximation. At each iteration, a column of A that is max- 
imally correlated with the residual is chosen, the index of this column is added 
to the list, and then the vestige of columns in the list is eliminated from the 
measurements, generating a new residual for the next iteration. OMP adds one 
new element of the dictionary and makes one orthogonal projection at each iter- 
ation. Generalized OMP [149] generalizes OMP by identifying multiple indices 
per iteration. Similarly, the orthogonal super greedy algorithm [87] adds multi- 
ple new elements of the dictionary and makes one orthogonal projection at each 
iteration. The performance of orthogonal multimatching pursuit, a counterpart 
of the orthogonal super greedy algorithm in the compressed sensing setting, is 
analyzed in [87] under RIP conditions. 

In order to solve the sparse approximation problem, a single sufficient condi- 
tion is developed in [141] under which both OMP and basis pursuit can recover 
an exactly sparse signal. For every input signal, OMP can calculate a sparse 
approximant whose error is only a small factor worse than the optimal error 
which can be attained with the same number of terms [141]. OMP can reliably 
recover a signal with K nonzero entries in dimension N given O(K In N) random 
linear measurements of that signal [142]. 

A sparse LMS algorithm [38] takes advantage of the sparsity of the underly- 
ing signal for system identification. This is done by incorporating two sparsity 
constraints into the quadratic cost function of the LMS algorithm. A recursive 
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L,-regularized LS algorithm is developed in [7] for the estimation of a sparse tap- 
weight vector in the adaptive filtering setting. It exploits noisy observations of 
the tap-weight vector output stream and produces its estimate using an EM-type 
algorithm. Recursive Lj-regularized LS converges to a near-optimal estimate in a 
stationary environment. It has significant improvement over the RLS algorithm 
in terms of MSE, but with lower computational requirements. 


Bibliographical notes 


Some good books on machine and statistical learning are Duda et al. (2000) 
[51], Bishop (1995) [22], Bishop (2006) [24], Ripley (1996) [112], Cherkassky and 
Mulier (2007) [40], Vapnik (1998) [148], and Hastie et al. (2005) [68]. 


Functional data analysis 

Functional data analysis is an extension of the traditional data analysis to func- 
tional data [108]. Functional data analysis characterizes a series of data points as 
a single piece of data. Examples of functional data are spectra, temporal series, 
and spatiotemporal images. Functional data are usually represented by regular 
or irregular sampling as lists of input-output pairs. Functional data analysis is 
closely related with the multivariate statistics and regularization. Many statisti- 
cal methods, such as PCA, multivariate linear modeling and CCA, can be applied 
within this framework. Conventional neural network models have been extended 
to functional data inputs, such as the RBF network [116], the MLP [117], SVMs 
[118], and k-NN method [21]. 


Parametric, semiparametric and nonparametric classification 
Pattern classification techniques with numerical inputs can be generally classified 
into parametric, semiparametric and nonparametric groups. The parametric and 
semiparametric classifiers need certain amount of a priori information about the 
structure of the data in the training set. Parametric techniques assume that the 
form of the pdf is known in advance except for a vector of parameters, which 
has to be estimated from the sample of realizations. In this case, smaller sample 
size can yield good performance if the form of the pdf is properly selected. When 
some insights about the form of the pdf are available, parametric techniques 
offer the most valid and efficient approach to density estimation. Semiparametric 
techniques consider models having a number of parameters not growing with the 
sample size, though greater than that involved in parametric techniques. 
Nonparametric techniques aim to retrieve the behavior of the pdf without 
imposing any a priori assumption on it; therefore, they require a sample size 
significantly higher than the dimension of the domain of the random variable. 
Density estimation methods using neural networks or SVMs fall into the cat- 
egory of nonparametric techniques. The Parzen’s windows approach [101] is a 
nonparametric method for estimating the pdf of a finite set of patterns; it has a 
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very high computational cost due to the very large number of kernels required 
for its representation. A decision tree such as C5.0 (http://www.rulequest. 
com/see5-info.html) is an efficient nonparametric method. A decision tree is 
a hierarchical data structure implementing the divide-and-conquer strategy. It 
is a supervised learning method, and can be used for both classification and 
regression. 


Complexity of circuits 

Complexity theory of circuits strongly suggests that deep architectures can 
be much more efficient than their shallow counterparts, in terms of computa- 
tional elements and parameters required to represent some functions. Theoreti- 
cal results on circuit complexity theory have shown that shallow digital circuits 
can be exponentially less efficient than deeper ones [66]. An equivalent result has 
been proved for architectures whose computational elements are linear thresh- 
old units [67]. Any Boolean function can be represented by a two-layer circuit 
of logic gates. However, most Boolean functions require an exponential number 
of logic gates (with respect to the input size) to be represented by a two-layer 
circuit. For example, the parity function, which can be efficiently represented by 
a circuit of depth O(log n) (for n input bits) needs O(2”) gates to be represented 
by a depth two circuit [156]. 


Categorical data 

Categorical data can generally be classified into ordinal data and nominal data. 
Ordinal and nominal data both have a set of possible states, and the value of a 
variable will be in one of those possible states. The difference between them is 
that the states in ordinal data are ordered but are unordered in nominal data. A 
nominal variable can only have two matching results, either match or does not 
match. For instance, hair color is a nominal variable that may have four states: 
black, blond, red, and brown. Service quality assessment is an ordinal variable 
that may have five states: very good, good, medium, poor, very poor. 

Ordinal regression is generally defined as the task where some input sample 
vectors are ranked on an ordinal scale. Ordinal regression is commonly formu- 
lated as a multiclass problem with ordinal constraints. The aim is to predict 
variables of ordinal scale. In contrast to traditional metric regression problems, 
these ranks are of finite types and the metric distances between the ranks are not 
defined. The naive idea is to transform the ordinal scales into numerical values 
and then solve the problem as a standard regression problem. 


Occam’s razor 

A widely accepted interpretation of Occam’s razor is: “Given two classifiers with 
the same training error, the simpler classifier is more likely to generalize better”. 
Domingos [47] rejects this interpretation and proposes that model complexity is 
only a confounding factor usually correlated with the number of models from 
which the learner selects. It is thus hypothesized that the risk of overfitting 
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(poor generalization) follows only from the number of model tests rather than 
the complexity of the selected model. The confusion between the two factors 
arises from the fact that a learning algorithm usually conducts a greater amount 
of testing to fit a more complex model. Experiements results on real-life datasets 
confirm Domingos’ hypothesis [157]. In particular, the experiments test the fol- 
lowing assertions. (i) Models selected from a larger set of tested candidate models 
overfit more than those selected from a smaller set (assuming constant model 
complexity). (ii) More complex models overfit more than simpler models (assum- 
ing a constant number of candidate models tested). According to Domingos’ 
hypothesis, the first assertion should be true and the second should be false. 


Learning from imbalanced data 

Learning from imbalanced data is a challenging problem. It is the problem of 
learning a classification rule from data that are skewed in favor of one class. 
Many real-world data sets are imbalanced and the majority class has much 
more training patterns than the minority class. The resultant hyperplane will 
be shifted towards the majority class. However, the minority class is often the 
most interesting one for the task. 

For the imbalanced data sets, a classifier may fail. The remedies can be divided 
into two categories. The first category processes the data before feeding them 
into the classifier, such as the oversampling and undersampling techniques, com- 
bining oversampling with undersampling, and synthetic minority oversampling 
technique (SMOTE) [34]. The oversampling technique duplicates the positive 
data by interpolation while undersampling technique removes the redundant 
negative data to reduce the imbalanced ratio. They are classifier-independent 
approaches. The second category belongs to the algorithm-based approach such 
as different error cost algorithms [85], and class-boundary-alignment algorithm 
[152]. The different cost algorithms suggest that by assigning heavier penalty to 
the smaller class, the skew of the optimal separating hyperplane can be corrected. 


2.1 A distance measure, or metric, between two points, must satisfy three con- 
ditions: 

e Positivity: d(x,y) > 0 and d(x,y) = 0 if and only if æ = y; 

e Symmetry: d(x, y) = d(y, x); 

e Triangle inequality: d(x, y) + d(y, z) > d(a, z). 

a) Show that the Euclidean distance, the city block distance, and the maximum 
value distance are metrics. 


b) Show that the squared Euclidean distance is not a metric. 
c) How about the Hamming distance? 


2.2 Is it possible to use a single neuron to approximate the function f(x) = «7? 
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2.3 Are the following set of points linearly separable? 
Class 1: (0,0,0,0), (1,0,0,1), (0,1,0,1); class 2: (1,1,1,1), (1,1,0,0), (1,0,1,0). 


2.4 For a K-class problem, the target tk, k = 1,..., K, is a vector of all zeros 
but for a one in the kth position, show that classifying a pattern to the largest 
element of y, if y is normalized, is equivalent to choosing the closest target, 
mink \|tx = ĝl. 


2.5 Given a set of N samples (Œp, tp), k = 1,..., N, derive the optimal least 
squares parameter w for the total training loss 


Compare the expression with that derived from the average loss. 


2.6 Assume that we have a class of functions { f(a, a)} indexed by a parameter 
vector a, with x € RP, f being an indicator function, taking value 0 or 1. If 
a = (a9,a1) and f is the linear indicator function I(ao + aix > 0), then the 
complexity of the class f is the number of parameters p + 1. 

The indicator function J(sin(ax) > 0) can shatter (separate) an arbitrarily 
large number of points by choosing an appropriately high frequency a. Show 
that the set of functions {J(sin(az) > 0)} can shatter the following points on 
the line: zı =27!,...,a.7 =2-™, VM. Hence the VC dimension of the class 
{I(sin(ax) > 0)} is infinite. 


2.7 For an input vector x with p components and a target y, the projection 
pursuit regression model has the form 


M 
fæ) = Ý gm(w?,2), 


m=1 


where Wm, m = 1,2,..., M, are unit p-vectors of unknown parameters. The 
functions gm are estimated along with the direction Wm using some flexible 
smoothing method. Neural networks are just nonlinear statistical models. Show 
how neural networks resemble the projection pursuit regression model. 


2.8 The XOR operation is a linearly inseparable problem. Show that a 
quadratic threshold gate 


_ f 1, if g(x) = Yi wizi + Via jan Wy ity 2 T 
0, otherwise 


can be used to separate them. Give an example for g(a), and plot the separating 
surface. 


2.9 Plot four points that are in general position. Show how they are separated 
by separating lines. 
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3.1 


Perceptrons 


One-neuron perceptron 


The perceptron [38], also referred to as a McCulloch-Pitts neuron or linear 
threshold gate, is the earliest and simplest neural network model. Rosenblatt 
used a single-layer perceptron for the classification of linearly separable patterns. 

For a one-neuron perceptron, the network topology is shown in Fig. 1.2, and 


the net input to the neuron is given by 
net =X wx; —0 = wx — 8, (3.1) 


where all the symbols are as explained in Sect. 1.2. The one-neuron perceptron 
using the hard-limiter activation function is useful for classification of vector x 
into two classes. The two decision regions are separated by a hyperplane 


wa —O0=0, (3.2) 


where the threshold 0 is a parameter used to shift the decision boundary away 
from the origin. 

The three popular activation functions are the hard limiter (threshold) func- 
tion, 


1, zr>0 
OTS ETN 0), r<0’ al 
the logistic function 
I 
d(x) = Iper (3.4) 
and the hyperbolic tangent function 
olx) = tanh(Zz). (3.5) 


In these functions, 8 is a gain, typically selected as unity, and is used to control 
the steepness of the activation function. These activation functions are illustrated 
in Fig. 3.1. 

All the above functions are monotonically increasing with the domain of output 
(—1,1) or (0,1). Sigmoidal functions are usually defined as those monotonically 
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Figure 3.1 Sigmoidal activation functions. 


3.2 


increasing functions satisfying lim; ++ (x) = 1, limz «0 (x) = 0. Many func- 
tions satisfy this definition if stretched out, and they can be treated as sigmoidal 
functions. Many other sigmoidal activation functions are introduced in [13]. 

A biologically more plausible perceptron is presented in [40] based on the 
integrate-and-fire model, with the derived learning rule which enables training of 
the neuron on nonlinear tasks. The model encodes the mean interspike interval, 
refractory period and voltage threshold. It is possible to train such a neuron 
model by seeking to minimize the output error, and derive a learning rule from 
the mean interspike interval of the neuron’s output. 


Single-layer perceptron 


When more neurons with the hard-limiter activation function are used, we have 
a single-layer perceptron, as shown in Fig. 3.2. The single-layer perceptron can 
be used to classify input vector data x into more classes. For a J)-J2 perceptron, 
the system state is updated by 


net = W'ax —9, (3.6) 

y = p(net), (3.7) 

where the net input vector net = (neti, net n)", the output vector y = 
(ĝi, 9%): 6 = (1,... Ory corresponds to all the biases in the second 


layer, and (net) = (o (neti)” ,..., bs, (net,.)) corresponds to all the activa- 
tion functions of the neurons. 

The problem of finding the weights of a single sigmoidal neuron that minimize 
the quadratic training error proves to be NP-hard [42]. The adaptation of W is 
error driven, which can be according to Rosenblatt’s perceptron learning algo- 
rithm [38, 39] or according to the LMS algorithm based on the adaline model 
[45]. 
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Figure 3.2 Architecture of the single-layer perceptron. 


3.3 


Perceptron learning algorithm 


Rosenblatt proved the perceptron convergence theorem for classification prob- 
lems [39]. 


Theorem 3.1 (Perceptron convergence). Given a one-neuron perceptron 
and input patterns x € X from two linearly separable classes. Let the patterns 
be presented in an arbitrary sequence in each epoch. Then, starting from an 
arbitrary initial state, the perceptron learning procedure always converges and 
yields a decision hyperplane between the two classes in finite time. 


From the perceptron convergence theorem, the weights of the perceptron will 
converge to a fixed point within a finite number of updates for a set of lin- 
early separable input patterns. The perceptron convergence theorem has been 
extended for the MLP, stating that the pattern mode BP algorithm converges 
to an optimal solution for linearly separable patterns with no upper bound on 
the learning rate [22]. 

The perceptron convergence theorem can be proved by minimizing the follow- 
ing perceptron criterion function using the gradient-descent method: 


E(w) = 5 (—w"z), (3.8) 
vex 
where X is the set of samples misclassified by w. Thus, the weights are modified 
in such a manner as to reduce the number of misclassifications. The perceptron 
convergence theorem can be easily extended to the single-layer perceptron by 
extending the perceptron learning algorithm from one neuron to multiple neu- 
rons. 
The perceptron learning algorithm is given as 


Jı 
nett j = 5 Tt iWij (t) = 0; = wy Ly = 0j, (3.9) 
i=l 
„~ _ Jl, nett; >0 
Ytj k otherwise a) 
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Figure 3.3 Use perceptron learning for classification. 
etj = Ytj — Ut (3.11) 
Wij (t + 1) = Wij (t) + NL iCt js (3.12) 
fori = 1,..., J1, j =1,..., J2, where net; j is the net input of the jth neuron for 
the tth example, wj = (wi;,w2;,.--, wyj)” is the vector collecting all weights 


terminated at the jth neuron, 0; is the threshold for the jth neuron, x;,; is the 
ith input of the tth example, +j and y%,; are, respectively, the network output 
and the desired output of the jth neuron for the tth example, with value 0 
or 1 representing classmembership, and y is the learning rate. All the weights 
wij are randomly initialized. The selection of 7 does not affect the stability of 
perceptron learning, and affects the convergence speed only for nonzero initial 
weight vector. 7 is typically selected as 0.5. The learning process stops when the 


errors are sufficiently small. 


Example 3.1: For a classification problem, the input (2,2), (—2,2) are in class 
0, (1,—2), (—1,1) are in class 1. Select the initial weights and bias as random 
numbers between 0 and 1. After training for one epoch, the algorithm converges. 
The result is illustrated in Fig. 3.3. In the figure, the learning class boundary 
is w! a — 0 = 0.92942, + 0.7757x2 + 0.4868 = 0. Randomly generate 10 points, 
and the learned perceptron can correctly classify them. Training can be imple- 


mented in adaptive learning mode. 


When used for classification, perceptron learning can operate only for linearly 
separable patterns, and does not terminate for linearly inseparable patterns. The 
failure of Rosenblatt’s and similar methods to converge for linearly inseparable 
problems is caused by the inability of the methods to detect the minimum of the 


error function [15]. 
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For a set of nonlinearly separable input patterns, the obtained weights of 
a perceptron may exhibit a limit cycle behavior. A perceptron exhibiting the 
limit cycle behavior is actually a neural network with time periodically varying 
coefficients. The minimum number of updates for the weights of the perceptron to 
reach the limit cycle depends on the initial weights. The boundedness condition of 
the perceptron weights is independent of the initial weights [25]. Also, a necessary 
and sufficient condition for the weights of the perceptron exhibiting a limit cycle 
behavior is derived, and the range of the number of updates for the weights of 
the perceptron required to reach the limit cycle is estimated in [25]. In [26], an 
invariant set of the weights of the perceptron trained by the perceptron training 
algorithm is defined and characterized. The dynamic range of the steady-state 
values of the weights can be evaluated by finding the dynamic range of the 
weights inside the largest invariant set. 

The pocket algorithm [20] improves on perceptron learning by adding a check- 
ing amendment to stop the algorithm; it optimally dichotomizes the given pat- 
terns in the sense of minimizing the erroneous classification rate. It can be applied 
for the classification of linearly inseparable patterns. The weight vector with the 
longest unchanged run is identified as the best solution so far and is stored in 
the pocket. The content of the pocket is replaced by any new weight vector with 
a longer successful run. The pocket convergence theorem guarantees the optimal 
convergence of the pocket algorithm, if the inputs in the training set are inte- 
gers or rational [20, 35]. The pocket algorithm with ratchet [20] evaluates the 
hypotheses on the entire training set and picks the best; it is asserted to find an 
optimal weight vector with probability one within a finite number of iterations, 
independently of the given training set [35]. 

Thermal perceptron learning [18] is obtained by multiplying the second term of 
(3.12) by a temperature annealing factor e- EA , where T is an annealing tem- 
perature. It finds stable weights for inseparable problems as well as for separable 
ones. It can be applied for the classification of linearly inseparable patterns. 


Least mean squares (LMS) algorithm 


The LMS algorithm [45] achieves a robust separation between the patterns of 
different classes by minimizing the MSE rather than the number of misclassified 
patterns through the gradient-descent method. Like perceptron learning, it can 
only be used for the classification of linearly separable patterns. In the LMS 
algorithm, the activation function is linear, and the error is defined by 


etj = Ytj — Nett j, (3.13) 


where net: j is defined by (3.9). The weight update rule is the same as (3.12), 
and is reproduced here 


wij (t + 1) = wij (t) + NT iCt,j- (3.14) 
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For classification problems, a threshold activation function is further applied 
to the linear output so as to render the final output to {0,1} or {+1,—1} 


1, net; > 0 


1 : = . mall 
a { 0, otherwise o>) 


The whole unit including a linear combiner and the following threshold operation 
is called an adaptive linear element (adaline) [45]. The above LMS rule is also 
called the -LMS rule. For practical purposes, 7 can be selected as 0< n < 
a ee to ensure its convergence. 

The Widrow-Hoff delta rule, known as the a-LMS, is a modification to the 
LMS rule obtained by normalizing the input vector so that the weights change 


independently of the magnitude of the input vector [46] 


wag (t + 1) = way (t) +p, (3.16) 
zal 
For the convergence of the a-LMS rule, 7 should be selected as 0 < ņ < 2, anda 
practical range for 7 is 0.1 < ņ < 1.0 [46]. Unlike perceptron learning, the LMS 
method can also be used for function approximation. In this case, the threshold 
operation in the adaline is dropped, and the behavior of the adaline is identical 
to that of linear regression. 

There are also madaline models using layered multiple adalines [46]. Madaline 
still cannot solve linearly inseparable problems, since the adaline network is a 
linear neural network and consecutive layers can be simplified to a single layer 
by multiplying the respective weight matrices. The Widrow-Hoff delta rule has 
become the foundation of modern adaptive signal processing. A complex LMS is 
given in [6]. 


Example 3.2: For a classification problem, the input (1,2), (—2,1) are in class 
0, (1,—1), (—1,0) are in class 1. Use the initial weights and bias as random 
numbers between 0 and 1. After training for one epoch, the algorithm converges. 
The result is illustrated in Fig. 3.4. In the figure, the learning class boundary 
is wl a — 0 = 0.044721 — 0.395022 + 0.7080 = 0. Randomly generate 50 points, 
and the learned linear model can correctly classify them. Training is implemented 
in adaptive learning mode. In the model, a threshold function is applied for 
classification. It classifies a pattern into class 0 if the output is less than 0.5, or 
class 1 otherwise. Notice that the learned linear model optmizes the MSE, but 
not the classification accuracy. 


Example 3.3: We use the linear networks to approximate f(x) = 20e7®™3® sin x + 
N(0, 1), where N(0, 1) is Gaussian nise with zero mean and variance 1. The result 


is illustrated in Fig. 3.5. In the figure, the learning class boundary is y = wT a — 
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Figure 3.4 Use of the LMS algorithm for classification. (a) The process of learning LMS boundary. (b) 
Classification result. (c) The change of the weights and bias. (d) The evolution of the MSE. 
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Figure 3.5 Use of the linear model for regression. 


0 = —0.2923a + 3.7344. The linear model achieves optimum approximation in 
terms of MSE. 
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P-delta rule 


The perceptron algorithm using margins [29] attempts to establish guarantees 
on the separation margin during the training process. In most cases, similar 
performance is obtained by the voted-perceptron, which has the advantage that 
it does not require parameter selection. Techniques using soft margin ideas are 
run-time intensive and do not give additional performance benefits [27]. In terms 
of run time, the voted perceptron does not require parameter selection and can 
therefore be faster to train. Both the voted perceptron and the margin variant 
reduce the deviation in accuracy in addition to improving the accuracy [27]. 

The voted perceptron [19] assigns each vector a vote based on the number of 
sequential correct classifications by that weight vector. Whenever an example is 
misclassified, the voted perceptron records the number of correct classifications 
made since the previous misclassification, assigns this number to the current 
weight vector’s vote, saves the current weight vector, and then updates as normal. 
After training, all the saved vectors are used to classify future examples and 
their classifications are combined using their votes. When the data are linearly 
separable and given enough iterations, both these variants will converge to a 
hypothesis that is very close to the simple perceptron algorithm. 

The single-layer perceptron can compute any Boolean function if their majority 
vote can be viewed as a binary output of the circuit, and they are universal 
approximators for arbitrary continuous functions with values in [0,1] if one can 
apply a simple squashing function to the percentage of votes with value 1 [4]. 
The parallel perceptron has just binary values as outputs of gates on the hidden 
layer, implementing a soft-winner-take-all gate. These extremely simple neural 
networks are also known as committee machines. 

The parallel delta (p-delta) rule is a simple learning algorithm for parallel 
perceptrons [4]. It has to tune a single layer of weights, and it does not require 
the computation and communication of analog values with high precision. These 
features make the p-delta rule attractive as a biologically more realistic alterna- 
tive to BP. The p-delta rule also implements gradient descent with regard to a 
suitable error measure, although it does not require to compute derivatives. The 
p-delta rule follows a powerful principle from machine learning for committee 
machines: maximization of the margin of individual perceptrons. 

Let (x,y) € R¢x[-1,+1] be the current training example and let 
W1,...,Wy E RÌ be the current weight vectors of the n individual perceptrons 
in the parallel perceptron. Thus the current output of the parallel perceptron is 
calculated as 


¥=s,(p), p=card{i: w,-x > 0}—card{i: w;- x < 0}, (3.17) 


where s,(p) is a squashing function analogious to the sigmoidal function, and 
card denotes the size of the set. 
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For regression problems the squashing function could be piecewise linear, 


Sp(p) = 4 p/p, if —p<p<p, (3.18) 
+1, ifp>p 
where 1 < p < n denotes the resolution of the squashing function. 
The p-delta rule is given by [4]. For all i = 1,...,n and accuracy e: 


—2Z, ify >yteandw;-r>0 
+a, ify<y—eandw,;-x <0 





A, =< +ux, ifG@<yteand0<w,-a4r<y , (3.20) 
—pe, ifg>y—eand —y<u;-r4<0 
0, otherwise 
w; — w;/||will, (3.21) 


where y > 0 is a margin, and pu, typically selected as 1, measures the importance 
of a clear margin. 


Theorem 3.2 (Universal approximation, [4]). Parallel perceptrons are uni- 
versal approzimators: Every continuous function g : Res [—1, 1] can be approx- 
imated by a parallel perceptron within any given error bound € on any closed and 


bounded subset of R? . 


Since any Boolean function from {0,1}¢ into {0,1} can be interpolated by a 
continuous function, any Boolean function can be computed by rounding the 
output of a parallel perceptron. 

Parallel perceptrons trained with the p-delta rule provide results comparable 
to that of MLP, madaline, decision tree (C4.5) and SVM, despite its simplicity 
[4]. The p-delta rule can also be applied to biologically realistic integrate-and-fire 
neuron models. It has already been applied successfully to the training of a pool 
of spiking neurons [33]. 

Direct parallel perceptrons [16] use an analytical closed-form expression to 
directly calculate the weights of parallel perceptrons that globally minimize an 
error function measuring simultaneously the classification margin and the train- 
ing error. They have no tunable parameters. They have a computational com- 
plexity linear in the number of patterns and in the input dimension. They are 
tenfold faster than p-delta and two orders of magnitude faster than SVM. They 
also allow online learning. 
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Other learning algorithms 


There are many other valid learning rules such as Mays’ rule [34, 46], the Ho- 
Kashyap rule [24, 14], and adaptive Ho-Kashyap rules [23]. The Ho-Kashyap 
algorithm uses the pseudoinverse of the pattern matrix in order to determine 
the solution vector. The adaptive Ho-Kashyap algorithm does not calculate the 
pseudoinverse of the pattern matrix, but the empirical choice of the learning 
parameters is critical for the convergence of the algorithm. Like the perceptron 
learning algorithm, these algorithms converge only in the case of linearly separa- 
ble datasets. The one-shot Hebbian learning [43] and nonlinear Hebbian learning 
[5] have also been used for perceptron learning. 

A single-layer complex-valued neural network [1] solves real-valued classifi- 
cation problems by a gradient-descent learning rule: It maps real input values 
to complex values, and after processing in the complex-valued domain, and the 
activation function then maps complex values to real values. 

Some single-layer perceptron learning algorithms are suitable for both lin- 
early separable and linearly inseparable classification problems. Examples are 
the convex analysis and nonsmooth optimization-based method [15], the linear 
programming (LP) method [14, 32], the constrained steepest-descent algorithm 
[37], fuzzy perceptron [10], and the conjugate-gradient (CG) method [36]. 

The problem of training a single-layer perceptron is to find a solution to a 
set of linear inequalities, thus it is known as an LP problem. LP techniques 
have been applied to single-layer perceptron learning [14, 32]. They can solve 
linearly inseparable problems. When the training vectors are from {—1,+1}¥1, 
the method requires O (JÈ logs J) learning cycles in the worst case, while the 
perceptron convergence procedure may require O (271) learning cycles [32]. 

The constrained steepest-descent algorithm [37] has no free learning param- 
eters. Learning proceeds by iteratively lowering the perceptron cost function 
following the direction of steepest descent, under the constraint that pat- 
terns already correctly classified are not to be affected. A decrease in the 
error is achieved at each iteration by employing the projection search direction 
when needed. The training task is decomposed into a succession of small-scale 
quadratic programming (QP) problems, whose solutions determine the appropri- 
ately constrained direction of steepest descent. For linearly separable problems, 
it always finds a hyperplane that completely separates the patterns belonging to 
different categories in a finite number of steps. In the case of linearly inseparable 
problems, the algorithm detects the inseparability in a finite number of steps 
and terminates, having usually found a good separation hyperplane. 

The CG algorithm [36] is also used for perceptron learning, where heuristic 
techniques based on reinitialization of the CG method is used. A control-inspired 
approach [12] is applied to the design of iterative steepest descent and CG algo- 
rithms for perceptron training in batch mode, by regarding certain parameters of 
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the training/algorithm as controls and then using a control Lyapunov technique 
to choose appropriate values of these parameters. 

The shifting perceptron algorithm [8] is a budget algorithm for shifting hyper- 
planes. Shifting bounds for online classification algorithms ensure good perfor- 
mance on any sequence of examples that is well predicted by a sequence of 
changing classifiers. 

Aggressive ROMMA [31] explicitly maximizes the margin on the new exam- 
ple, relative to an approximation of the constraints from previous examples. 
NORMA [28] performs gradient descent on the soft margin risk resulting in 
an algorithm that rescales the old weight vector before the additive update. 
The passive-aggressive algorithm [11] adapts 7 on each example to guarantee 
that it is immediately separable with margin. A second-order perceptron called 
Ballseptron [41] establishes a normalized margin and replaces margin updates 
with updates on hypothetical examples on which a mistake would be made by 
using spectral properties of the data in the updates. ALMA [21] renormalizes 
the weight vector so as to establish a normalized margin. It tunes its parameters 
automatically during the online session. ALMA has p-norm variants that can 
lead to other tradeoffs improving the performance, for example, when the target 
is sparse. 

The use of nonlinear activation functions causes local minima in the objective 
functions based on the MSE criterion. The number of such minima can grow 
exponentially with the input dimension [3]. When using an objective function 
that measures the errors before the neuron’s nonlinear activation function instead 
of after them, for single-layer neural networks, the new convex objective function 
does not contain local minima and the global solution is obtained using a system 
of linear equations [7]. A theoretical analysis of this solution is given in [17], and 
a new set of linear equations, to obtain the optimal weights for the problem, are 
derived. 


Sign-constrained perceptron 
The perceptron learning rule and most existing learning algorithms for linear 
neurons or perceptrons are not true to physiological reality. In these algorithms, 
weights can take values of any sign. However, biological synapses are either exci- 
tatory or inhibitory and usually do not switch between excitation and inhibition. 
This fact is commonly referred to as Dale’s law. In fact, many neurophysiologists 
prefer the assumption that only excitatory synapses are directly used for learn- 
ing, whereas inhibitory synapses are tuned for other tasks. In the latter case, 
one arrives at a perceptron with nonnegative weights as a more realistic model. 
A variation of the perceptron convergence theorem for sign-constrained weights 
was proven in [2]. It tells us that if a sign-constrained perceptron can implement 
a given dichotomy, then it can learn it. 

An analysis of the classification capability of a sign-constrained perceptron is 
given in [30]. In particular, the VC dimension of sign-constrained perceptrons 
is determined, and a necessary and sufficient criterion is provided that tells us 
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when all 2” dichotomies over a given set of m patterns can be learned by a sign- 
constrained perceptron. Uniformity of Lı norms of input patterns is a sufficient 
condition for full representation power in the case where all weights are required 
to be nonnegative. Sparse input patterns improve the classification capability of 
sign-constrained perceptrons. The VC dimension is n+ 1 for an unconstrained 
perceptron with input dimension n, while that of sign-constrained perceptrons 
is n [30]. 


3.1 Design a McCulloch-Pitts neuron to recognize the letter “X” digitalized in 
an 8 x 8 array of pixels. 


3.2 Show that the hyperbolic tangent function (3.5) is only a biased and scaled 
logistic function (3.4). 


3.3 Verify that the following functions can be used as sigmoidal functions: 


1, >å 
(a) da) = 4 +4, -a<a<a. 
0, u<-a 


(b) d(x) = 2 arctan(Gz). 
(c) O(a) = $ + + arctan(Gz). 


3.4 A Taylor-series approximation of the logistic function is given by [44] 


L 
olx) = b-2+0.527? z< 0 
bi 
L- peos TZO 


where b > 2 is a constant. When b = 2, $(x) is a continuous function. Plot this 
function. 


3.5 Show that the single-layer perceptron is a linear classifier. The perceptron 
can be used to implement the binary logic functions AND, OR, and COMPLE- 
MENT, but not EXCLUSIVE OR (XOR). Show how it can or cannot implement 
these logic functions. 


3.6 Build perceptrons that construct logical NOT, NAND, and NOR of their 
inputs. 


3.7 The parity problem returns 1 if the number of inputs that are 1 is even, 
and 0 otherwise. 

(a) Try to use a perceptron to learn the parity problem of 3 inputs. 

(b) Show that the parity function of n > 2 binary input £1, £2, ..., En cannot be 
simulated by a perceptron. 


3.8 Isit possible to train a perceptron using a perceptron algorithm in which 
the bias is left unchanged and only the other weights are modified? 
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3.9 Generate 20 vectors in two dimensions, each belonging to one of two classes. 
Write a program to implement the perceptron learning algorithm. Plot the deci- 
sion boundary after each iteration, and investigate the behavior of the algorithm. 
The data set can be linearly separable or linearly inspearable. 


3.10 For a multilayer forward network, if all neurons operate in their linear 
regions, show that such a network can reduce to a single-layer feedforward net- 
work. 


3.11 The a-LMS rule is given by 
Tk 
Wk+1 = Wk +a(dk — Yk) 5. 
+ I|ex|? 
where dp € R is the desired output, av; is the input vector, and a > 0. 
(a) Show that the a-LMS rule can be derived from an incremental gradient 
descent on 

1> (di — yi)? 

J(w) = -+= —_—_——. 

32 Ta 
(b) Show that the Widrow-Hoff rule is stable when 0 < a < 2, unstable when 
a > 2, and is oscillatory when a = 2. 


3.12 Given the two-class problem with class 1: (3,4), (3,1); class 2: (—2,—1), 
(—3, —4). 

(a) With wo = (1,0, 0), find the separating weight vector. 

(b) Plot the decision surface. 
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4 Multilayer perceptrons: architecture 
and error backpropagation 


4.1 Introduction 


MLPs are feedforward networks with one or more layers of units between the 
input and output layers. The output units represent a hyperplane in the space 
of the input patterns. The architecture of MLP is illustrated in Fig. 4.1. Assume 
that there are M layers, each having Jm, m = 1,..., M, nodes. The weights from 
the (m — 1)th layer to the mth layer are denoted by W"—)); the bias, output and 
activation function of the ith neuron in the mth layer are, respectively, denoted 
as af”), ol™) and of” (.). An MLP trained with the BP algorithm is also called 
a BP network. MLP can be used for classification of linearly inseparable patterns 
and for function approximation. 

From Fig. 4.1, we have the following relations. Notice that a plus sign precedes 


the bias vector for easy presentation. For m = 2,..., M and the pth example: 
dp = 08", oP = ap, (4.1) 
T 

net”) = [we] of) 1 g™ (4.2) 

of™) = g™ (netg™) , (4.3) 
(m) m) 
FOP oie” diye" 

i (M4) 














Figure 4.1 Architecture of MLP. 
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T 
where nets” = (net? is net”) , Wel is a Jm-1-by-Jm matrix, 
T T 
fr) = (on, ae gf) , Oo") = (of, ae 6") is the bias vector, 


and gp” (.) applies o\™ (-) to the ith component of the vector within. 

All of (-) are typically selected to be the same sigmoidal function; one can 
also select all of (-) in the first M — 1 layers as the same sigmoidal function, 
and all oe” (-) in the Mth layer as another continuous yet differentiable function. 


Universal approximation 


MLP is a universal approximator. Its universal approximation capability stems 
from the nonlinearities used in the nodes. The universal approximation capability 
of four-layer MLPs has been addressed in [115], [45]. It has been mathematically 
proved that a three-layer MLP using sigmoidal activation function can approx- 
imate any continuous multivariate function to any accuracy [22, 44, 34, 137]. 
Usually, a four-layer network can approximate the target with fewer connection 
weights, but this may, however, introduce extra local minima [16, 115, 137]. 
Xiang et al. provided a geometrical interpretation of MLP on the basis of the 
special geometrical shape of the activation function. For the target function with 
a flat surface located in the domain, a small four-layer MLP can generate better 
results [137]. 

MLP is very efficient for function approximation in high-dimensional spaces. 
The error convergence rate of MLP is independent of the input dimensionality, 
while conventional linear regression methods suffer from the curse of dimension- 
ality, which results in a decrease of the convergence rate with an increase of the 
input dimensionality [5]. The necessary number of MLP neurons for approximat- 
ing a target function depends only upon the basic geometrical shape of the target 
function, and not on the dimensionality of the input space. Based on a geometri- 
cal interpretation of MLP, the minimal number of line segments or hyperplanes 
that can construct the basic geometrical shape of the target function is suggested 
as the first trial for the number of hidden neurons of a three-layer MLP [137]. A 
similar result is given in [149], where the optimal network size can be selected 
according to the number of extrema and the number of hidden nodes should be 
selected as the number of extrema. 

Approximation of piecewise continuous functions using smooth activation 
functions requires many hidden nodes and many training iterations, but still 
does not yield very good results due to the Gibbs phenomenon. In [106], a neu- 
ral network structure is given for approximation of piecewise continuous func- 
tions. It consists of neurons having standard sigmoidal functions, plus some addi- 
tional neurons having a special class of nonsmooth activation functions termed 
jump approximation basis function. This structure can approximate any piece- 
wise continuous function with discontinuities at a finite number of known points. 
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A constructive proof that a real, piecewise continuous function can be almost 
uniformly approximated by three-layer feedforward networks is given in [71]. The 
constructive procedure avoids the Gibbs phenomenon. 

The approximation of sufficiently smooth multivariable functions with an MLP 
is considered in [122]. For a given approximation order, explicit formulas for the 
necessary number of hidden units and its distributions to the hidden layers of 
MLP are derived. It turns out that more than two hidden layers are not needed 
for minimizing the number of necessary hidden units. Depending on the number 
of inputs and the desired approximation order, one or two hidden layers should 
be used. For high approximation orders (> 12), two hidden layers should be used 
instead of one hidden layer. The same is true for smaller approximation orders 
and a sufficiently high number of inputs, as long as the approximation order is 
at least three. A sufficient condition is given for the activation function for which 
a high approximation order implies a high approximation accuracy. 

The Ł-II network [103] is a generalization of MLP. Unlike MLP, it uses product 
units as well as summation units to build higher-order terms. The BP learning 
rule can be applied to the learning of the network. The X-II network is known to 
provide inherently more powerful mapping capabilities than first-order models 
such as MLP. It is a universal approximator [44]. However, it has a combinatorial 
increase in the number of product terms and weights. 


Backpropagation learning algorithm 


BP learning is the most popular learning rule for performing supervised learning 
tasks [103, 132]. It is not only used to train feedforward networks such as MLP, 
but also is adapted to RNNs. The BP algorithm is a generalization of the delta 
rule called the LMS algorithm. Thus, it is also called the generalized delta rule. 
It uses a gradient-search technique to minimize a cost function equivalent to the 
MSE between the desired and actual network outputs. Due to the BP algorithm, 
MLP can be extended to many layers. 

The BP algorithm propagates backward the error between the desired signal 
and the network output through the network. After providing an input pattern, 
the output of the network is then compared with a given target pattern and the 
error of each output unit calculated. This error signal is propagated backward, 
and a closed-loop control system is thus established. The weights can be adjusted 
by a gradient-descent based algorithm. 

In order to implement the BP algorithm, a continuous, nonlinear, monotoni- 
cally increasing, differentiable activation function is needed. The logistic function 
and the hyperbolic tangent function are usually used. In the following, we derive 
the BP algorithm for MLP. BP algorithms for other neural network models can 
be derived in a similar manner. 

The objective function for optimization is defined as the MSE between the 
actual network output y,, and the desired output y, for all the training pattern 
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pairs (ap, Yp) E S 














il; 1 A 2 
Be, aa Ds Ip — Vell 4 (4.4) 
pes pes 
where N is the size of the sample set, and 
1. 2 1 
E,= z Op — Yp || = 5E €r» (4.5) 
Ep = Ue i Yp, (4.6) 


where the ith element of ep is ep,; = Ûp,i — Yp,i- Notice that a factor 4 is used in 
Ep for the convenience of derivation. 

All the network parameters W=" and o, m = 2,..., M, can be combined 
and represented by the matrix W = [w;,,;]. The error function E or Ep can be 
minimized by applying the gradient-descent procedure. When minimizing Ep, 
we have 
OE, 
OW’ 
where n is the learning rate or step size, provided that it is a sufficiently small 
positive number. Note that the gradient term ore is a matrix whose (i, j)th 
entry is i ; 

Applying the chain rule, the derivative in (4.7) can be expressed as 





A,W = -n (4.7) 








OE _ OE» aneth 
dw Anett” aw 


The second factor of (4.8) is derived from (4.2) 


(4.8) 


dnet ont) 0 Jm ( ) ( ) 
P = wo) 4 gl | = o0, (4.9) 
awn? Own 2 i 5 


The first factor of (4.8) can again be derived using the chain rule 


v 


OE dE, ofh JE, . 
=a = — — = = an (nerga) , (4.10) 
ðnetpv op.v ðnetp v op v z 


where (4.3) is used. To solve the first factor of (4.10), we need to consider two 
situations for the output units (m = M — 1) and for the hidden units (m = 
1,..., M — 2): 

OE, 


ee ee, WS (4.11) 
aot) > 
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Ep  _ ae OE» anet” 
daah Gay N e T Bony” 





Jm+1 
JE» ð (m+1) (m+1) 
—— ——_ wT oiT + gim+2) 
cai | Onetpu’” dopet? > j 


I 
Mii 


S 


m+2 
OE 
=) Sma he eS Nj MD (4.12) 
= Onetpu 
w=1 pw 


Define the delta function by 


.,M. (4.13) 


By substituting (4.8), (4.12), and (4.13) into (4.10), we finally obtain for the 
output units (m = M — 1) and for the hidden units (m = 1,..., M — 2): 


5M) = —ep 0M (net) , m=M-1, (4.14) 
. Jm+2 

mH) — fmt) (neti + ) y Sm HDeylmtl), m=1,...,M—2. (4.15) 
w=1 


Equations (4.14) and (4.15) provide a recursive method to solve 6\",*) for the 
whole network. Thus, W can be adjusted by 


OE, 





— _s(m+1) (m) 
a) T One Oru (4.16) 
For the activation functions, we have the following relations: 
ġ(net) = Belnet) [1 — ¢(net)], for logistic function, (4.17) 
ġlnet) = B [1 — ¢°(net)|, for tanh function. (4.18) 


The update for the biases can be in two ways. The biases in the (m + 1)th 
layer e+) can be expressed as the expansion of the weight W% , that is, 


T 
girth) — (wi? Sua Ay a) . Accordingly, the output 0” is expanded into 
T 
o™ = (ior, na ian) . Another way is to use a gradient-descent method 


with regard to 6™ by following the above procedure. Since the biases can be 
treated as special weights, these are usually omitted in practical applications. 
The BP algorithm is defined by (4.7), and is rewritten here 


OE 
A, W(t) = —n—. 4.1 
The algorithm is convergent in the mean if 0 < n< <<, where Amax is the 


largest eigenvalue of the autocorrelation of the vector z, denoted by R [135]. 
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Algorithm 4.1 (BP for a three-layer MLP). 


All units have the same activation function ¢(-), and all biases are 
absorbed into weight matrices. 


1. Initialize W and WC 
2. Calculate E using pt 
3. For each epoch: 
— Calculate E using (4.4). 
— if E is less than a threshold €, return. 
= For each xp, p=1,...,N: 
a. Forward pass 
i. Compute net” by (4.2) and oP by (4.3). 
ii. Compute net? by (4.2) and Jp = of” by (4.3). 
iii. Compute ep by (4.6). 
b. Backward pass, for all neurons 


i. Compute i =— nt (reta) 

ii. Update W) by Aw?) = = nô) o?), 
iii. Compute 5?) - a E. $ (ner) 
iv. Update W by Aw\)) = = 16200), 


4. end 





When 77 is too small, the possibility of getting stuck at a local minimum of the 
error function is increased. In contrast, the possibility of falling into oscillatory 
traps is high when 77 is too large. By statistically preprocessing the input patterns, 
namely, decorrelating the input patterns, the excessively large eigenvalues of R 
can be avoided and thus, increasing 7 can effectively speed up the convergence. 
PCA preconditioning speeds up the BP in most cases, except when the pattern 
set consists of sparse vectors. In practice, 7 is usually chosen to be 0< 7 < 1 
so that successive weight changes do not overshoot the minimum of the error 
surface. The flowchart of the BP for a three-layer MLP is shown in Algorithm 4.1. 
The BP algorithm can be improved by adding a momentum term [103] 


jo 
~ TOW (i) 


where a is the momentum factor, usually 0 < a < 1. The typical value for a is 0.9. 
This method is usually called the BP with momentum. The momentum term can 
effectively magnify the descent in almost-flat steady downhill regions of the error 
surface by 7+. In regions with high fluctuations (due to high learning rates), the 
momentum has a stabilizing effect. The momentum term actually inserts second- 
order information in the training process that performs like the CG method. 


A,W(t) = 





+aAW(t— 1), (4.20) 
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Figure 4.2 Descent in weight space. (a) for small learning rate; (b) for large learning rate; (c) for large 
learning rate with momentum term added. 


The momentum term effectively smoothes the oscillations and accelerates the 
convergence. The role of the momentum term is shown in Fig. 4.2. BP with 
momentum is analyzed and the conditions for convergence are given in [129]. 

In addition to the gradient and momentum terms, a third term, namely, a 
proportional term, can be added to the BP update equation. The algorithm 
can be applied for both batch and incremental learning. For each example, the 
learning rule can be written as [152] 





A, W(t) = Po +aA,W(t—1)+7E,(W(t)) 1, (4.21) 
ƏW (t) 

where the matrix 1 has the same size as W but with all the entries being unity, 
and y is a proportional factor. This three-term BP algorithm is analogous to the 
common PID control algorithm used in feedback control. Three-term BP, having 
a complexity similar to the BP, significantly outperforms the BP in terms of the 
convergence speed and the ability to escape from local minima. It is more robust 
to the choice of the initial weights, especially when relatively high values for the 
learning parameters are selected. 

The emotional BP modifies the BP with additional emotional weights that are 
updated using two additional emotional parameters: the anxiety coefficient and 
the confidence coefficient [54]. 

In the above, the optimization objective is Bp and the weights are updated 
after the presentation of each pattern. Thus, the learning is termed as incremental 
learning, online learning, or pattern learning. When optimizing the average error 
E, we get the batch learning algorithm, where weights are updated only after all 
the training patterns are presented. 

The essential storage requirement for the BP algorithm consists of all the Nw 
weights of the network. The computational complexity per iteration of the BP 
is around N,, multiplications for the forward pass, around 2N,, multiplications 
for the backward pass, and Nọ multiplications for multiplying the gradient with 
n. Thus, four multiplications are required per iteration per weight [51]. 

Since BP is a gradient-descent technique, it is prone to local minima in the cost 
function. The performance can be improved and the occurrence of local minima 
can be reduced by allowing extra hidden units, lowering the gain term, and by 
training with different initial random weights. The process of presenting all the 
examples in the pattern set, with each example being presented once, is called 
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an epoch. Neural networks are trained by presenting all the examples cyclically 
by epoch, until the convergence criteria is reached. The training examples should 
be presented to the network in a random order during each epoch. 


Incremental learning versus batch learning 


Incremental learning and batch learning are two methods for BP learning. For 
incremental learning, the training patterns are presented to the network sequen- 
tially. It is a stochastic optimization method. For each training example, the 
weights are updated by the gradient-descent method 


(4.22) 


The learning algorithm has been proved to minimize the global error Æ when 
Nine is sufficiently small [103]. 

In batch learning, the optimization objective is Æ, and the weight update is 
performed at the end of an epoch [103]. It is a deterministic optimization method. 
The weight incrementals for each example are accumulated over all the training 
examples before the weights are actually adapted 


Aut, ) = batch p = 5 Aput ) (4.23) 
ij Pp 


For sufficiently small learning rates, incremental learning approaches batch learn- 
ing and the two methods produce the same results [32]. 

Incremental learning can be used when the complete training set is not avail- 
able, and it is especially effective when the training set is very large, which 
necessitates large additional storage in the case of batch learning. For small con- 
stant learning rates, the randomness introduced provides incremental learning 
with a quasiannealing property, and allows for a wider exploration of the search 
space, which often helps in escaping from local minima [20]. However, incremen- 
tal learning is hard to parallelize. 

Gradient-descent algorithms are only truly gradient descent when their learn- 
ing rates approach zero; thus, both the batch and incremental learning are using 
approximations of the true gradient as they move through the weight space. 
When batch is sufficiently small, batch learning follows incremental learning 
quite closely. 

Incremental learning tends to be orders of magnitude faster than batch learn- 
ing, and is at least as accurate as batch learning, especially for large training sets 
[136]. Online training is able to follow curves in the error surface throughout each 
cycle, which allows it to safely use a larger learning rate and thus converge with 
fewer iterations through the training data. For large training sets, batch learning 
is often completely impractical due to the minuscule patch required. Incremental 
training can safely use a larger ninc, and can thus train more quickly. 
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Example 4.1: As explained in [136], for a training set with 20,000 examples, if 
n is selected as 0.1 and the average gradient is of the order of +0.1 for each 
weight per example, then the total accumulated weight change for batch learn- 
ing will be of the order of +0.1 x 0.1 x 20000 = +200. The A change in weight 
is unreasonably big and will result in wild oscillations across the weight space. 











When using incremental learning, each weight change will be of the order of 
+0.1 x 0.1 = +0.01. Thus, for a converging batch learning with patch, the cor- 
responding incremental learning algorithm can take nine = N7batch, where N is 
the size of the training set. 








It is recommended in [136] that ninc = VN batch. As long as y is small enough 
to avoid drastic overshooting of curves and local minima, there is a linear rela- 
tionship between 7 and the number of epochs required for learning. 

Although incremental training has advantages over batch training with respect 
to the absolute value of the expected difference, it does not, in general, con- 
verge to the optimal weight with respect to the expected squared difference [41]. 
Almost-cyclic learning is a better alternative for batch mode learning than cyclic 
learning [41]. In [86], the convergence properties of the two schemes applied to 
quadratic loss functions is analyzed and the rate of convergence for each scheme 
is given. 

Analysis shows that with any analytic sigmoidal function incremental BP 
training is always convergent under some mild conditions [139]. Incremental 
training converges to the optimal weight with respect to the expected squared 
difference, if the variance of the random per-instance gradient decays exponen- 
tially with the number of epochs processed during training. With proper 7 and 
the decay rate of the variance, incremental training converges to the optimal 
weight faster than batch training does. If the training set size is sufficiently large, 
then with regard to the absolute value of the expected difference, batch training 
converges faster to the globally optimal weight than incremental training does 
if 7 < 1.2785. With respect to the expected squared difference, batch training 
converges to the globally optimal weight as long as 7 < 2. The rate of conver- 
gence with respect to the absolute value of the expected difference improves 
monotonically as 7 increases up to N for incremental training, whereas batch 
training fails to converge if 7 > 2. Based on the estimate of the minimum error 
a dynamic learning rate for incremental BP training og three-layer feedforward 
networks [150] ensures the error sequence to converge to the global minimum 
error. 


Example 4.2: Approximate the function: 


T1T2 


f(£1, £2) = 4a, sin(1021) cos(10x£2) + a1x2e7!”? cos(20a1 22), 
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Figure 4.3 Approximation of a function using MLP. The training algorithms are BP and BP with 
momentum in batch mode. 


where 0 < zı < 1,0 < z2 < 1. 

We use a three-layer network with 30 hidden nodes to approimate the func- 
tion. The training algorithm is BP in batch mode. The learning rate is selected as 
0.004, 441 data points are uniformly generated for training. We also implement 
BP with momentum in batch mode, and the additional momentum constant is 
selected as 0.9. Figure 4.3a plots the function. Figure 4.3b plots the MSE evo- 
lution for 100,000 epochs. The approximation error for the BP case is shown in 





Figure 4.3c, and the result for BP with momentum is even worse. The conver- 
gence of BP with momentum is of the same order as that of BP in our simulation. 
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(a) (b) 
Figure 4.4 Approximation of a function using MLP. The training algorithms are BP and BP with 
momentum in online mode. 


Example 4.3: We retrain the function shown in Example 4.2. This time we imple- 
ment BP and BP with momentum in online mode. The simulation setting is the 
same as that in Example 4.2. The difference is that the learning rate is selected 
as 0.5, and the momentum constant is selected as 0.9. Figure 4.4a plots the MSE 
evolution for 100,000 epochs. For a random run, the approximation error for the 
BP case is shown in Fig. 4.4b, and the result for the BP with momentum is 
worse. In our experiment, we found the learning rate for online algorithms can 
be set very large. 


Example 4.4: In the iris data set, shown in Fig. 4.5, 150 patterns are classified 
into 3 classes. Each pattern has four numeric properties. We use a 4-4-1 MLP to 
learn this problem, with three discrete values representing different classes. The 
logistic sigmoidal function is selected for the hidden neurons and linear function 
is used for the output neurons. Two learning schemes are applied. Eighty per 
cent of the data set is used as training data, and the remaining 20% as testing 
data. We set the performance goal as 0.001, and the maximum number of epochs 
as 1000. We simulate and compare BP and BP with momentum. 

During generalization, if the network output for an input pattern is closest to 
one of the attribute values, the pattern is identified as belonging to that class. 
For batch BP and BP with momentum, 7 and a both are randomly distributed 
between 0.3 and 0.9. For online algorithms, 7 is randomly distributed between 
0.5 and 10.5, and a is randomly distributed between 0.3 and 0.9. Table 4.1 lists 
the results based on an average of 50 random runs. The traces of the training 
error are plotted in Fig. 4.6 for a random run. For classification, if the distance 
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Figure 4.5 Plot of the iris dataset: (a) x, vs. x3. (b) £2 vs. £4. 


Table 4.1. Performance comparison of a 4-4-1 MLP trained with BP and BP with momentum. 


Algorithm Training Classification std Mean training std (s) 
MSE accuracy(%) time (s) 

BP, batch 0.1060 92.60 0.1053 9.7086 0.2111 

BPM, batch 0.0294 95.60 0.0365 9.7615 0.3349 

BP, online 0.1112 86.33 0.1161 11.1486 0.2356 

BPM, online 0.1001 86.67 0.0735 11.2803 0.3662 


BPM—BP with momentum. 


4.5 


between the neural network output and the desired output is greater than 0.5, 
we count in an classification error. 

From the simulation, we can see that the performance of BP as well as that 
of BP with momentum is highly dependent on the learning parameters selected, 
which are difficult to find for practical problems. There is no clear evidence that 
BP with momentum is superior to BP or that the algorithms in online mode are 
superior to their counterparts in batch mode. 


Activation functions for the output layer 


Usually, all neurons in MLP use the same sigmoidal activation function. This 
restricts the outputs of the network to the range of (0,1) or (—1,1). For classi- 
fication problems, this representation is suitable. However, for function approxi- 
mation problems the output may be far from the desired output, and the training 
algorithm is actually invalid. A common solution is to apply preprocessing and 
postprocessing. 
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Figure 4.6 Iris classification trained using 4-4-1 MLP with BP and BP with momentum: the traces of 
the training error for a random run. 


The preprocessing and postprocessing procedures are not necessary if the acti- 
vation function for the neurons in the output layer is selected as a linear function 
(a) = x to increase the dynamical range of the network output. A three-layer 
MLP of such networks with Jz hidden units has a lower bound for the degree 
of approximation [76]. By suitably selecting an analytic, strictly monotonic, sig- 
moidal activation function, this lower bound is essentially attainable. 

When the activation function of the output layer is selected as ¢(x) = x, the 
network can thus be trained in two steps. With the linearity property of the 
output units, there is the relation 


F 
[wen] a = y,, (4.24) 


where wi), a Jm-1-by-Jm matrix, can be optimized by the LS method such 
as the SVD, RLS, or CG method [78]. The CG method converges to the exact 
solution in Jm-1ı or Jm steps, whichever is larger. BP is then used to update the 
remaining weights. 

The generalized sigmoidal function is introduced in [104, 87] for neurons in 
the output layer of an MLP used for 1-of-n classification. This unit is some- 
times called the soft-max or Potts unit [104]. The generalized sigmoidal function 
introduces a behavior that resembles in some respects the behavior of WTA 
networks. The sum of the outputs of the neurons in the output layer is always 
equal to unity. The use of the generalized sigmoidal function introduces addi- 
tional flexibility into the MLP model. Since the response of each output neuron 
is tempered by the responses of all the output neurons, the competition actually 
fosters cooperation among the output neurons. A single-layer perceptron using 
the generalized sigmoidal function can solve linearly inseparable classification 
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problems [87]. The output of the ith output neuron is defined by 


e 
p0 = (net) =—— y 
eas ens 


where the summation in the denominator is over all the neurons in the output 


net(™) 


(4.25) 


layer. The derivative of the generalized sigmoidal function is of) (1 = a) 


i 


which is identical to the derivative for the logistic sigmoidal function. 


Optimizing network structure 


Smaller networks that use fewer parameters usually have better generalization 
capability. When training an MLP, the optimal number of neurons in the hidden 
layers is unknown and is estimated usually by trial-and-error. Network pruning 
and network growing are used to determine the size of the hidden layers. 

Network-pruning strategy first selects a network with a large number of hidden 
units, then removes the redundant units during the learning process. Pruning 
approaches usually fall into two broad groups. In sensitivity-based methods, one 
estimates the sensitivity of the error function E to the removal of a weight or unit, 
and removes the least important element. In penalty-based methods, additional 
terms are added to the error function E so that the new objective function 
rewards the network for choosing efficient solutions. The BP algorithm derived 
from this objective function drives unnecessary weights to zero and removes 
them during training. The two groups overlap if the objective function includes 
sensitivity terms. 


Network pruning using sensitivity analysis 


Network pruning can be performed based on the relevance or sensitivity analysis 
of the error function E with respect to a weight w. The relevance or sensitivity 
measure is usually used to quantify the contribution that individual weights or 
nodes make in solving the network task. The less relevant weights or units can 
be removed. Mathematically, the normalized sensitivity is defined by 


gE li AE OnE wd0E 
= lim = = 


= n, 4.26 
Aw—0 Aw Olnw EOw ( ) 


In the skeletonization technique [85], the sensitivity of E with respect to w is 
defined as SE = -wE This definition of sensitivity has been applied in [53]. 
In Karnin’s method [53], during the training process, the sensitivity for each 
connection is calculated by making use of the available terms. Upon completion 
of the training process, those connections that have low sensitivities are pruned, 
and no retraining procedure is necessary. This method has been further improved 
in [36] by devising some pruning rules to prevent an input being removed from 
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the network or a particular hidden layer being totally removed. A fast training 
algorithm is also included to retrain the network after a weight is removed. 
In [98], Karnin’s method has been extended by introducing the local relative 
sensitivity index within each subgroup or layer of the network. This enables 
parallel pruning of weights that are relatively redundant in different layers of a 
feedforward network. 

Sensitivity analysis of large dimensional overtrained networks are conducted 
in order to assess the relative importance of each hidden unit on the network 
output by computing the contribution of each hidden unit to the network output. 
A sensitivity-based method utilizing retraining is described in [108]. The output 
of each hidden unit is monitored and analyzed for all the training set after the 
network converges. If the output of a hidden unit is approximately constant for 
all the training set, this unit actually functions as a bias to all the neurons it 
feeds, and hence can be removed. Similarly, if two hidden units produce the same 
or proportional outputs for all the training set, one of the units can be removed. 
Small weights are assumed to be irrelevant and are pruned. After some units are 
removed, the network is retrained. This technique leads to a prohibitively long 
training process for large networks. 

A sensitivity-based method that uses linear models for hidden units is devel- 
oped in [49]. If a hidden unit can be well approximated as a linear model of its 
net input, then it can be eliminated and replaced by adding biases in subsequent 
layers and by changing weights that bypass the unit. Thus, such units can be 
pruned. No retraining of the network is necessary. In [14], an effective hidden 
unit-pruning algorithm called linear-dependence pruning utilizing sets of linear 
equations is presented; it improves upon the linear models [49] and includes net- 
work retraining. Redundant hidden units are well modeled as linear combinations 
of the outputs of the other units. Hidden units are modeled as linear combina- 
tions of nonlinear units in the same layer and in the earlier layers. The hidden 
unit that is predicted to increase the training error the least when replaced by 
its model is identified, and the pruning algorithm replaces it with its model and 
retrains the weights connecting to the output layer by one iteration of training. 
A pruning procedure described in [11] iteratively removes hidden units and then 
adjusts the remaining weights in such a way as to preserve the overall network 
behavior. The pruning problem is formulated as solving a set of linear equations 
by a CG algorithm in the LS sense. 

In [52], orthogonal transforms such as SVD and QR with column pivoting 
(QR-cp) are used for pruning neural networks. QR-cp coupled with SVD is used 
for subset selection and elimination of the redundant set. Based on the trans- 
forms on the training set, one can select the optimal sizes of the input and 
hidden nodes. The reduced-size network is then reinitialized and retrained to 
the desired convergence. In [117], the significance of increasing the number of 
neurons in the hidden layer of a feedforward network is evaluated using SVD. 
A pruning/growing technique based on the singular values of a trained network 
is then used to estimate the necessary number of neurons in the hidden layer. 
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Practical measures of sensitivities to inputs are developed, and utilized towards 
deletion of redundant inputs in [151]. When one or more dimensions of the input 
vectors have relatively small sensitivity in comparison to others, that dimension 
of the input vectors can be removed, and a smaller-size neural network can be 
successfully retrained in most cases. 

A two-phase approach for pruning both the input and hidden units of MLPs 
based on mutual information is proposed in [138]. All features of the input vectors 
are first ranked according to their relevance to target outputs through a forward 
strategy. The salient input units of an MLP are thus determined according to the 
ranking and their contributions to the network performance, and the irrelevant 
features of the input vectors can be identified and eliminated. The redundant 
hidden units are then removed from the trained MLP one after another according 
to a relevance measure. 


Optimal brain damage and optimal brain surgeon 
The optimal brain damage (OBD) [59] and optimal brain surgeon (OBS) [40] 
procedures are two network pruning methods based on the perturbation analysis 
of the second-order Taylor expansion of the error function. 

In the following, we use wW to represent the vector generated by concatenating 
all entries of W. When the training process converges, the gradient is close to 
zero, and thus the increase in E due to a change in w is given by 


AE ~ 5A HAG, (4.27) 
where H is the Hessian matrix, H = oe 
Removing a weight w; amounts to equating this weight to zero. Thus, removing 
a subset of weights, Sprune, results in a change in E by setting Aw; = wi, if 
i € Sprune, Otherwise Aw; = 0. Based on the saliency (4.27), OBD is a special 
case of OBS, where the Hessian H is assumed to be a diagonal matrix; in this 
case, each weight has a saliency 


1 
(AE), ~ zwi Hü. (4.28) 


In the procedure, a weight with the smallest saliency is selected for deletion. The 
calculation of the Hessian H is fundamental to the OBS procedure. 

Optimal cell damage [19] extends OBD to remove irrelevant input and hidden 
units. The unit-OBS [113] improves OBS by removing one whole unit in each 
step. The unit-OBS can also conduct feature extraction on the input data by 
removing unimportant input units. As an intermediate between OBD and OBS, 
the principal components pruning [68] is based on a block-diagonal approxima- 
tion of the Hessian; it is based on PCA of the node activations of successive layers 
of trained feedforward networks for a validation set. The node activation correla- 
tion matrix at each layer is required, while the calculation of the full Hessian of 
the error function is avoided. This method prunes the least salient eigen-nodes, 
and network retraining is not necessary. 
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In the case of early stopping, OBD and OBS are not suitable since the network 
is not in a local minimum and the first-order term in the Taylor-series expan- 
sion is not zero. Early brain damage [123] is an extension to OBD and OBS in 
connection with early stopping; furthermore, it allows the revival of the already 
pruned weights. 

A pruning procedure similar to OBD is constructed using the error covariance 
matrix P obtained during RLS training [67]. As P is obtained along with the 
RLS algorithm, pruning becomes much easier. The RLS-based pruning has a 
computational complexity of O (N3), which is much smaller than that of OBD, 
namely, O (N2N i while its performance is very close to that of OBD in terms of 
the number of pruning weights and generalization ability. In addition, the RLS- 
based pruning is also suitable for the online situation. Another network pruning 
technique, based on the training results from the extended Kalman filtering 
(EKF) technique, is given in [114]. The method prunes a neural network based 
solely on the obtained error covariance matrix P and the state (weight) vector. 

The variance nullity pruning [26] is based on the sensitivity analysis of the 
output, rather than on that of the error function. If the gradient search and 
the MSE function are used, then OBD and the output sensitivity analysis are 
conceptually the same under the assumptions that the Hessain H is diagonal. 
Parameter relevance is measured as the variance in sensitivity over the training 
set, and those hidden or input nodes that are irrelevant are removed. The pruned 
network is then retrained. 


Network pruning using regularization 
For the regularization technique, the optimization objective is defined as 
Er = E + \cEe, (4.29) 


where F is the error function, Ee is a penalty for the complexity of the struc- 
ture, and àe > 0 is a regularization parameter, which needs to be appropriately 
determined for a particular problem. Extra local minima are introduced to the 
optimization process by the penalty term. 

In the weight-decay technique [43, 130, 47], Ee is defined as a function of the 


weights. In [43], Ee is defined as the sum of the squares of all the weights 


Ee=Ņ we. (4.30) 
ij 


As a result, the change of each weight is proportional to its value. In [47], Ee is 
defined as the sum of the absolute values of the weights 


Ee = X (wijl. (4.31) 
i,j 


Thus, all the weights are decaying at a constant step to zero. 
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The BP algorithm derived from Er using a weight-decay term is a structural 
learning algorithm 








m) OFT _ apm) _ 2E 
Awi = Tm = ij, BP aw™ ’ (4.32) 
ij a 
where Aw!”?., = -nZ is the weight change corresponding to BP learning, 
ih BP i 8 


Iw 
and € = ņàe is the decaying coefficient at each weight change. The amplitudes of 
the weights decrease continuously towards zero, unless they are reinforced by the 
BP rule. At the end of training, only the essential weights deviate significantly 
from zero. By pruning the weights that are close to zero, a skeleton network 
is obtained. This effectively increases generalization and reduces the danger of 
overtraining as well. For example, in the modified BP with forgetting [47, 56], 
the weight-decay term (4.31) is used, and Ga = sign (wi? 
the signum function. Neural networks trained by weight-decay algorithms are 
not sensitive to the initial choice of the network. 

The weight-decay technique given in [37] is an implementation of a robust 
network that is insensitive to noise. It decays the weights towards zero by weak- 
ening the small weights more rapidly. Because small weights can be used by 


i where sign(-) is 


the network to code noisy patterns, this weight-decay mechanism is especially 
important in the case of noisy data. The weight-decay technique converges as 
fast as BP, if not faster, and shows some significant improvement over BP in 
noisy situations [37]. 

The conventional RLS algorithm is essentially a weight-decay algorithm [67], 
since its objective function is similar to that for the weight-decay technique using 
(4.30). The error covariance matrix P obtained during the RLS training possesses 
properties similar to the Hessian matrix H of the error function. The initial value 
of P, namely, P(0) can be used to control the generalization ability. 

The weight-smoothing regularization introduces the constraint of Jacobian 
profile smoothness during the learning step [2]. Other regularization methods 
include neural Jacobians like the input perturbation [9], or generalized regular 
network [97] that minimize the neural Jacobian amplitude to smooth the neural- 
network behavior. 

Bayesian regularization [72] can determine the optimal regularization parame- 
ters in an automated fashion. This eliminates the need to guess the optimum net- 
work size. In this framework, the weights and biases of the network are assumed 
to be random variables with specified distributions. The regularization param- 
eters are related to the unknown variances associated with these distributions. 
These parameters are then estimated using statistical techniques. For small data 
sets, Bayesian regularization provides better generalization performance than 
early stopping does, because Bayesian regularization does not require a valida- 
tion data set and thus uses all the data. 
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Figure 4.7 Pruning an MLP by regularization: (a) A 1-20-1 network introduces overfitting. (b) The 
pruned network has better generalization. 


4.6.3 


Compressing the weights of a layer of an MLP is equivalent to compressing 
the input of the layer [28]. Thus, some ideas from compressed sensing can be 
transferred to the training of MLP. 


Example 4.5: We generate 41 noisy datapoints from a function f(x) = 
sin(2rx)e”. We train a 1-20-1 network to approximate the functions, and found 
the learned function is overfitted, as shown in Fig. 4.7a. 

We use E, defined in (4.30). But Ee is taken as the mean values among the 
number of all weights, A. = 1. Bayesian regularization is implemented; it pro- 
vides a measure of how many network parameters (weights and biases) are being 
effectively used by the network. The final trained network uses approximately 
11 parameters out of the 61 total weights and biases. We can see from Fig. 4.7b 
that the network response is very close to the underlying function (dotted line), 
and, therefore, the network will generalize well to new inputs. 


Network growing 


Another approach to training MLP is the constructive approach, which starts 
with a small network and then gradually adds hidden units until a given per- 
formance is achieved. This helps us in finding a minimal network. Constructive 
algorithms are computationally more economical than pruning algorithms. The 
constructive approach also helps in escaping a local minimum by adding a new 
hidden unit. When the error E does not decrease or decreases too slowly, the 
network may be trapped in a local minimum, and a new hidden unit is added 
to change the shape of the error function and thus to escape from the local 
minimum. The weights of the newly added neurons can be set randomly. 
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Cascade-correlation learning is a well-known constructive learning approach. 
It is an efficient technique both computationally and in terms of modeling perfor- 
mance [31]. In the cascaded architecture, each newly recruited unit is connected 
both to the input nodes and to every pre-existing unit. For each newly added hid- 
den unit k, all the weights connected to the previously trained units are frozen, all 
the weights connected to the newly added unit and the output units are updated. 
Network construction is based on the one-by-one training and addition of hidden 
units. Training starts with no hidden unit. If the minimal network cannot solve 
the problem after a certain number of training cycles, a set of candidate hidden 
units with random initial weights are generated, from which an additional hidden 
unit is selected and added to the network. The constructed network has direct 
connections between the input and output units. Moreover, the depth or the 
propagation delay through the network is directly proportional to the number 
of hidden units and can be excessive. Many ideas from the cascade-correlation 
learning are employed in the constructive algorithms [58], [66]. 

The dependence identification algorithm constructs and trains an MLP by 
transforming the training problem into a set of quadratic optimization problems, 
which are then solved by a succession of sets of linear equations [84]. It is a batch 
learning process. The method uses the concept of linear dependence to group 
patterns. The overall convergence speed is orders of magnitude faster than that 
of BP, although the resulting network is usually large [84]. The algorithm is a 
faster and more systematic method for developing initial network architectures 
than the trial-and-error or gradient-based pruning techniques. 

A constructive learning algorithm for MLP using an incremental learning pro- 
cedure has been proposed in [70], which may be useful for real-time learning. 
Training patterns are learned one by one. The algorithm starts with a single 
training pattern and a single hidden neuron. During training, when the algo- 
rithm gets stuck at a local minimum, the weight-scaling technique is applied to 
help the algorithm to escape from the local minimum. If the algorithm fails in 
escaping from a local minimum after several consecutive attempts, the network 
is allowed to grow by adding a hidden neuron. Initial weights for the newly added 
neuron are selected using an optimization procedure based on the QP and LP 
techniques. 

A constructive training algorithm for three-layer MLPs for classification prob- 
lems is given in [99]. The Ho-Kashyap algorithm is central to training both the 
hidden layer nodes and the output layer nodes. A pruning procedure that removes 
the least important hidden node, one at a time, can be included to increase the 
generalization ability of the method. When constructing a three-layer MLP using 
a quasi-Newton method [107], the quasi-Newton method is used to minimize the 
sequence of error functions associated with the growing network. 

Examples of early constructive methods for training feedforward networks with 
LTG neurons are the tower algorithm [35], the tiling algorithm [82] and the 
upstart algorithm [33]. These algorithms are based on the pocket algorithm [35], 
and are used for classification. 
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Figure 4.8 Sigmoidal functions and their derivatives, 3 = 1. (a) Hyperbolic tangent function and its 
derivative. (b) Logistic function and its derivative. 


4.7 


4.7.1 


Speeding up learning process 


BP is a gradient-descent method and has a slow convergence speed. Numerous 
measures have been reported in order to speed up the convergence of the BP 
algorithm. 

Preprocessing of a training pattern set relieves the curse of dimensionality, and 
also improves the generalization ability of the network. Preprocessing is efficient 
when the training set is very large. A feature selection method particularly suited 
for feedforward networks has been proposed in [102]. An L1-norm saliency metric 
describing the sensitivity of the outputs of the trained network with respect to 
the jth input is used. In [93], a feature extraction method that exhibits some 
similarity to PCA has been proposed, where the L2-norm is used in the saliency 
metric. 


Eliminating premature saturation 


One major reason for slow convergence is the occurrence of premature saturation 
of the output of the sigmoidal functions. This can be seen from Fig. 4.8, where 
the sigmoidal functions and their derivatives are plotted. When the absolute 
value of net is large, (net) is so small that the weight change approaches zero 
and learning takes an excessively long time. This is the flat-spot problem. 

Once trapped at saturation, the outputs of saturated units preclude any sig- 
nificant improvement in the training weights directly connected to the units. 
The premature saturation leads to an increase in the number of training cycles 
required to release the trapped weights. In order to combat premature saturation, 
one can modify the slope of the sigmoidal function, or modify the error func- 
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tion EF so that when the output unit is saturated, i.e. the slope of the sigmoidal 
function approaches zero, the backpropagated error is finite. 

In [62], premature saturation of network output units has been analyzed as 
a static phenomenon that occurs at the beginning of the training stage as a 
consequence of random initialization of weights. The probability of premature 
saturation at the first training cycle is derived as a function of the initial weights, 
the number of nodes at each layer, and the slope of the sigmoidal function. 

A dynamic mechanism for premature saturation is analyzed in [125]. The 
momentum term is identified as a leading role in the occurrence of premature sat- 
uration. The entire premature saturation process is partitioned into three distinct 
stages, namely, the beginning of the premature saturation, saturation plateau, 
and the complete recovery from saturation. For the onset of premature satura- 
tion to occur, a set of four necessary conditions must be simultaneously satisfied 
and usually remain satisfied for a number of consecutive iterations. A method for 
preventing premature saturation is to temporarily modify the momentum factor 
a, once the four conditions are satisfied at iteration t. If more than one output 
unit satisfies the four conditions, a is calculated for each of these units and the 
smallest a used to update AW (t + 1). The original a is used again after the 
(t+ 1)th iteration. The algorithm works like BP unless the four conditions are 
satisfied simultaneously. 

In [63], the BP update equation is revised by adding a term embodying the 
degree of saturation to prevent premature saturation. This turns out to be adding 
an additional term embodying the degree of saturation in the energy function. 
Similarly, in [88], the partial derivatives of the logistic activation function are 


cies 
generalized to [om (1 — on.) ro with po > 1 so that error signals are significantly 


m, approaches saturation. Other authors also avoid premature 


pi 
saturation by using modified energy functions [90, 128]. 


A modified BP algorithm is derived in [1] based on a criterion with an addi- 
tional linear quadratic error term 


l 
Ep = = 
2 


where ¢-1(-), the inverse of ¢(-), applies to each component of the vector within, 
and A. is a small positive number, usually 0 < A. < 1. For each pattern, the 
modified BP is slightly more complex than BP, while it always has a significantly 
faster convergence than BP has in the number of training iterations and in the 
computation time for a suitably selected Ae, which can be decreasing from one 


enlarged when o 


F 2 1 E 2 
Vy — Yp || + 5r.||net™ — ¢ "(y,)| ; (4.33) 








to zero during the learning process. 

The modified BP algorithm for the three-layer MLP [128] significantly 
improves BP in both the accuracy and the convergence speed. Besides, 7 can be 
selected as a large value without the worry of saturation. A new term embodying 
the saturation degree is added to the conventional criterion function to prevent 
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premature saturation [128] 











Ja 
lis. Tia 
By 5 Yp voll 5 lop vpli X (op — 0.5), (4.34) 
j=1 


where Jy is the number of hidden units, and 0.5 is the average value of a sig- 
moidal activation function. D (0p,; — 0.5)” is defined as the saturation degree 
for all the hidden neurons for pattern p. 

While many attempts have been made to change the shape of the flat regions 
or to avoid them during MLP learning, a constructive approach proposed in [105] 
exploits the flat regions to stably and successively find excellent solutions. 


Adapting learning parameters 


The performances of the BP and BP-with-momentum algorithms are highly 
dependent upon a suitable selection for 7 and a. Some heuristics are needed 
for optimally adjusting 7 and a to speed up the convergence of the algorithms. 
Learning parameters are typically adapted once for each epoch. 


Globally adapted learning parameters 

All the weights in the network are typically updated using the global learning 
parameters 7 and a. The optimal 7 is the inverse of the largest eigenvalue, 
Amax, Of the Hessian matrix H of the error function [61]. The online algorithm 
for estimating Amax proposed in [61] does not even require a calculation of the 
Hessian. 

A simple and popular method for accelerating the learning is to use the search- 
then-converge schedule, which starts with a large 7 and gradually decreases it 
as the learning proceeds. According to [103], the process of adapting 77 is similar 
to that in simulated annealing. The algorithm escapes from a shallow local min- 
imum in early training and converges into a deeper, possibly global minimum. 
Typically, 7 is selected as 


nt) = —, (4.35) 





where T, is the search time. 
The bold-driver technique is a heuristic for optimal network performance [126, 
6], where at the (t + 1)th epoch 7 is updated by 


P™n(t), AE(t) <0 
n(t+1)= E AE(t) > 0° (4.36) 


where p* is chosen to be slightly larger than unity, typically 1.1, p~ is chosen 
significantly less than unity, typically 0.5, and AE(t) = E(t) — E(t — 1). If the 
error decreases (AF < 0), the training is approaching the minimum, and we 
can increase 7 to speed up the search process. If, however, the error increases 
(AE > 0), the algorithm must have overshot the minimum and thus 7 is too 
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large. In this case, the weights are not updated. This process is repeated until 
a decrease in the error is found. a can be selected as a fixed value. However, 
at each occurrence of an error increase, the next weight update is needed to 
be along the negative gradient direction to speed up the convergence and a is 
set to 0 temporarily. Variants of this heuristic method include those with fixed 
increment of n. 

The gradient-descent rule can be reformulated as 


B(t +1) = Bit) — A(t)VE (Bl), (4.37) 


where wW is a vector formed by concatenating all the columns of W. In [91], the 
learning rate of batch BP is adapted according to the instantaneous value of 
E(t): 


p(E(t)) 











t) = pr: 4.38 
n( ) PVE (w(t) ( ) 
where po is a positive constant, p(£) is a function of E, typically p(£) = E, and 
OE (w 
VE (w(t)) = ac | (4.39) 
t 


This adaptation leads to fast convergence. However, 7(t) is a very large number 
in the neighborhood of a local or global minimum, leading to jumpy behavior 
of the weights. The method converges faster than the Quickprop algorithm does 
[29]. 

In [74], 7 is updated according to the local approximation of the Lipschitz 
constant L(t) based on Armijo’s condition for line search. The algorithm is robust 
against oscillations due to large 7 and avoids the phenomenon of the nearly 
constant E value, by ensuring that Æ is decreased with every weight update. 
The algorithm results in an improvement in the performance when compared 
with that of the BP, delta-bar-delta [48], and bold-driver [126] techniques. 

The fuzzy inference system is also used to adapt the learning parameters for an 
MLP with BP [17]. The fuzzy system incorporates Jacobs’ heuristics [48] about 
the unknown learning parameters using fuzzy IF-THEN rules. The heuristics 
are driven by the behavior of E(t). Change in E(t), denoted by AE(t), is an 
approximation to the gradient of Æ, and change in AE(t) is an approximation 
to the second-order derivatives of E. Fuzzy inference systems are constructed for 
adjusting 7 and a, respectively. This fuzzy BP learning is much faster than BP, 
with a significantly smaller MSE [17]. 


Locally adapted learning parameters 


Each weight wim) can have its own learning rate ni” (t) so that 


Awg (t) = =f” (aly (0), (4.40) 


tj 
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where the gradient 


_ OE 


to Ow!) 





off (t) = VE (wf) 


1J 





(4.41) 





t 

There are many locally adaptive learning algorithms using weight-specific 
learning rates such as the heuristics proposed in [118, 109], SuperSAB [120], 
delta-bar-delta [48], Quickprop [29], equalized error BP [79], and a globally con- 
vergent strategy [75]. 

Due to the nature of the sigmoidal function, a large input may result in sat- 
uration that will slow down the adaptation process. In [118], the learning rates 
for all input weights to a neuron is selected to be inversely proportional to the 
fan-in of the neuron, namely 
Ko 


(m-1) 2 
net™ (t) 


Nij (t) (4.42) 
where «Ko is a small positive number. This can maintain a balance among the 
learning speed of units with different fan-in. The increase in the convergence 
speed is theoretically justified by studying the eigenvalue distribution of H [60]. 
The heuristic proposed in [109] and in SuperSAB [120] is to adapt i by 


+,,(m) (m) (m) 
m nom; t) gy @)- gy E—1) >0 
ng (E+1)=4 9 G ame = (4.43) 
no nis (t) gy (t)- gý (E-1) <0 


(m) 


where ng >1, 0<mņ <1. In SuperSAB, ng ad = Since 7; grows and 
0 


ij 
decreases exponentially, too many successive acceleration steps may generate 


too large or too small ng 


this, a momentum term is included in SuperSAB. 
The delta-bar-delta algorithm [48] is similar to the heuristic (4.43), but elimi- 
nates its problems by making linear acceleration and exponential deceleration of 


the learning rates. Individual nf (t) are updated based on a local optimization 


ij 
method, and the change An” (t) is given by 


and thus slow down the learning process. To avoid 


" Kos a = Dg @) >0 
Ang (O= S -Ae IE- Daj <0, (4.44) 
0, otherwise 
where 
m G= r t= 1), (4.45) 


and £, Ko, B are positive constants specified by the user. All ns are initialized 
with small values. Basically, a” (t) is an exponentially decaying trace of gradient 
values. Inclusion of a momentum term sometimes causes the delta-bar-delta to 


diverge, and as such an adaptively changing momentum has been introduced to 
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improve the delta-bar-delta [83]. However, the delta-bar-delta requires a careful 
selection of the parameters. 

In the Quickprop method [29, 31], a (t) are heuristically adapted. Quickprop 
is given by 


Aw! = | 7 Awi E- 1), Aus (-)40 aa 
” mogi” (©), Aw!) (t—1) =0 
where 
(m) 
(m) (4 
an” (t) = min ee anas} ; (4.47) 
Jij (t-1)- Jij (t) 


Qmax is typically 1.75 and 0.01 < 7 < 0.6; ņo is only used at the start or restart 
of the training. To avoid the flat-spot problems, Quickprop can be improved 
by adding 0.1 to the derivative of the sigmoidal function [29]. The use of error 
gradient at two consecutive time steps is a discrete approximation to second- 
order derivatives, and the method is actually a quasi-Newton method that uses 
the so-called secant steps. Q@max is used to avoid very large Quickprop updates. 
Quickprop typically performs very reliably and converges very fast [95]. However, 
the simplification of the Hessian to a diagonal matrix used in Quickprop has 
not been theoretically justified and convergence problems may occur for certain 
tasks. 

In [146, 145], 7 and al”) are optimally tuned using three methods, namely, 
the second-order-based, first-order-based, and CG-based methods. These meth- 
ods make use of the derivatives of Æ with respect to ne ) and al), and the 
information gathered from the forward and backward procedures, but do not 
need explicit computation of the first- and second-order derivatives in the weight 
space. The computational and storage burdens are at most triple that of BP, with 
an order of magnitude faster speed. 

A general theoretical result has been derived for developing first-order batch 
learning algorithms with local learning rates based on Wolfe’s conditions for lin- 
ear search and the Lipschitz condition [75]. This result provides conditions under 
which global convergence is guaranteed. This globally convergent strategy can be 
equipped with algorithms of this class to adapt the overall search direction to a 
descent one at each training iteration. When Quickprop [29] and the algorithms 
given in [109] are equipped with this strategy, they exhibit a significantly better 
percentage of success in reaching local minima than their original versions. 


Initializing weights 


The initial weights of a network play a significant role in the convergence of 
a training method. Poor initial weight values may result in slow convergence or 
lead the network stuck at a local minimum. The objective of weight initialization 
is to find weights that are as close as possible to a global minimum before train- 
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ing, and to increase the convergence speed. By weight initialization, the outputs 
of the hidden neurons can be assigned in the nonsaturation region. Without a 
priori knowledge of the final weights, it is common practice to initialize all the 
weights with random small absolute values, or with small zero-mean random 
numbers [103]. Randomness also helps to break the symmetry of the system, 
which gradient-based learning algorithms are unable to break, and thus prevents 
redundancy in the network. Starting from large weights may prematurely satu- 
rate the units and slow down the learning process. Theoretically, the probability 
of prematurely saturated neurons in MLP increases with the maximal value of 
the weights [62]. By statistical analysis, the maximum amplitude for the initial 
weights is derived in [24]. For the three-layer MLP, a weight range of [—0.77, 0.77] 
empirically gives the best mean performance over many existing random weight 
initialization techniques [119]. 

There are many heuristics for weight initialization. In [131], the initial weights 
of the ith unit at the jth layer are selected based on the order of To where 

nj 


nid ) is the number of weights to the 7th unit at the jth layer. When the weights 
. . . af . _ 3 3 . 
to a unit are uniformly distributed in | T == , the total input to that 


unit, net? ) is a random variable with zero mean and a standard deviation of 
unity. This is an empirical optimal initialization of the weights [119]. In [81], 
the weights are first randomly initialized to the range [—ao, ao], ao > 0, and 
are then individually scaled to ensure that each neuron is active over its full 


dynamic range. The scaling factor for the weights connected to the ith neuron 


at the jth layer is given by pË L Eu where DË ) is the dynamic range of the 
activation function. The optimal magnitudes of the initial weights and biases can 
be determined based on multidimensional geometry [142]. This method ensures 
that the outputs of the hidden and output layers are well within the active 
region, while the dynamic range of the activation function is fully utilized. The 
hidden-layer weights can be initialized in such a way that each hidden node is 
assigned to approximate a portion of the range of the desired function based on 
a piecewise-linear approximation of a sigmoidal function at the start of network 
training [89]. 

In addition to heuristics, there are many methods for weight initialization using 
parametric estimation. The sensitivity of BP to the initial weights is discovered 
to be a complex fractal-like structure for convergence as a function of the initial 
weights [55]. There are various weight-estimation techniques, where a nonlinear 
mapping between pattern and target is introduced [23, 24, 64]. 

Clustering is useful for weight initialization of three-layer MLPs. A three- 
layer MLP with prototypes is initialized in [23], based on supervised clustering. 
In [111], the clustering and nearest-neighbor methods are utilized to initialize 
hidden-layer weights, and the output-layer weights are then initialized by solv- 
ing a set of linear equations using SVD. In the initialization method given in 
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[133], the clustering and nearest-neighbor classification technique are used for 
a number of cluster sets, each representing the training examples with a dif- 
ferent degree of accuracy. The orthogonal least squares (OLS) method is used 
as a practical weight initialization algorithm for MLP in [64]. The maximum 
covariance initialization method [65] uses a procedure similar to that of the 
cascade-correlation algorithm [31]. An optimal weight initialization algorithm 
for the three-layer MLP [143] initializes the hidden-layer weights that extract 
the salient feature components from the input data based on ICA, whereas the 
initial output-layer weights are evaluated to keep the output neurons inside the 
active region. 

The optimal initial weights can be evaluated using the LS and linear alge- 
braic method [141, 140]. In [141], the optimal initial weights between layers are 
evaluated using the LS method by assigning the outputs of hidden neurons with 
random numbers in the range between 0.1 and 0.9. The actual outputs of the 
hidden neurons are obtained by propagating the input patterns through the net- 
work. The optimal weights between the hidden and output layers can then be 
evaluated by using the LS method. In [140], the weights connected to the hidden 
layers are determined by the Cauchy’s inequality and the weights connected to 
the output layer are determined by the LS method. 

MLP learning is to estimate a nonlinear mapping between the input and the 
output of the examples, ®, by superposition of the sigmoidal functions. By using 
a Taylor-series development of ® and the nonlinearity of the sigmoidal func- 
tion, two weight initialization strategies for the three-layer MLP are obtained 
based on the first- and second-order identification of ® [21]. These techniques 
effectively avoid local minima, significantly speed up the convergence, obtain a 
better generalization, and estimate the size of the network. 


Adapting activation function 


During training, if a unit has a large net input, net, the output of this unit is 
close to a saturation region of its sigmoidal function. Thus, if the target value 
is substantially different from that of the saturated one, the unit has entered a 
flat spot. Since the first-order derivative of the sigmoidal function ¢(net) is very 
small when net is large in magnitude, the weight update is very slow. Fahlman 
developed a simple solution by adding a bias, typically 0.1, to (net) [29]. Hinton 
[42] suggested the design of an error function that goes to infinity at points where 
(net) > 0. This leads to a finite nonzero error update. 

One way to solve the flat-spot problem is to define an activation function such 
that [144] 


bp (net) = unet + (1 — p)d(net), n€ [0, 1]. (4.48) 


At the beginning, u = 1 and all the nodes have linear activation, and BP is used 
to obtain a local minimum in FE. Then p is decreased gradually and BP is applied 
until y = 0. The flat-spot problem does not occur since ¢,,(net) > u and u > 0 
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for most of the training time. When u = 1, E(W, p) is a polynomial of W and 
thus has few local minima. This process can be viewed as an annealing process, 
which helps us in finding a global or good minimum. 

For sigmoidal functions, such as the logistic and hyperbolic tangent functions, 
the gain 8 represents the steepness (slope) of the activation function. In BP, 
G is fixed and typically 8 = 1. A modified BP with an adaptive 8 significantly 
increases the learning speed and improves the generalization [57, 112]. Each 
neuron has its own variable gain B, which is adapted by gradient descent 


OE 


apy” 


(4.49) 
where 77g is a small positive learning rate. 

A large gain Ø yields results similar to those with a high learning rate n. 
Changing ( is equivalent to changing 7, the weights and the biases. This is 
asserted by Theorem 4.1 [27]. 


Theorem 4.1 (Eom, Jung & Sirisena (2003) [27]). An MLP with the 
logistic activation function ¢(-), gain B, learning rates n, weights W, and biases 
0 is equivalent to a network of identical topology with the activation function 
¢(-), gain 1, learning rates 3?n, weights BW and biases 60, in the sense of BP 
learning. 


A fuzzy system for automatically tuning the gain @ has been proposed in 
[27] to improve the performance of BP. The inputs of the fuzzy system are the 
sensitivities of the error with respect to the output and hidden layers, and the 
output is the appropriate gain of the activation function. 

An adaptation rule for the gain 8 is derived using the gradient-descent method 
based on a sigmoidal function such as [13] 


ote) = ( . i (4.50) 


1 +e? 





where 8 € (0,00). For 8 £1, the derivative ọla) is skewed and its maxima shift 
from the point corresponding to x = 0 for 8 = 1 and the envelope of the deriva- 
tives is also sigmoidal. The method is an order of magnitude faster than standard 
BP. 

A sigmoidal activation function with a wide linear part is derived in [25] by 
integrating an input distribution of the soft trapezoidal shape to generate a 
probability of fullfilment 


j 1 + efille—c) +8 





28 
where @ is the gain parameter, c decides the center of the shape, and b> 0 
decides the slope at x = 0. A larger b leads to a smaller slope, while a larger 8 
generates a longer linear part. When b — 0, the log-exp function approaches the 
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Figure 4.9 Illustration of the log-exp and logistic functions. For the log-exp function, c = 0. 


4.8 


logistic function ¢(a) = Toe For the same (3, the log-exp function always has 
a longer linear part and a smaller slope at x = 0. 

An illustration of the log-exp function as well as the logistic function is shown 
in Fig. 4.9. For the same Ø, the log-exp function always has a larger nonsaturation 
zone and a wider linear part than the logistic function. For the log-exp function, 
the slope at x = 0 is decided by b and the width of the linear part is determined 
by Ø. The logistic function can extend its nonsaturation zone by decreasing p, 
but the width of its linear part is still limited. The extended linear central part 
of the log-exp function prevents premature saturation and thus makes training 
of MLPs quickly. 

Consider an algorithm whose time to convergence is unknown. Consider the 
following strategy. Run the algorithm for a specific time T. If it has not converged 
for T, rerun it from the start. This restart mechanism [29] is advantageous in 
problems that are prone to local minima or when there is a large variability in 
convergence time from run to run, and may lead to a speed-up in such cases. It 
can reduce the overall average and standard deviation of the training time. The 
restart mechanism has also been applied in many optimization applications. It is 
theoretically analyzed in [73], where conditions on the probability density of the 
convergence time for which restart will improve the expected convergence time 
are obtained and the optimal restart time is derived. 


Some improved BP algorithms 
BP can be accelerated by extrapolation of each individual weight [51]. This 
extrapolation procedure is easy to implement and is activated only a few times 


in between iterations of BP. It leads to significant savings in computation time 
of BP and the solution is always located in close proximity to the one obtained 
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by the BP procedure. BP by weight extrapolation reduces the required number 
of iterations at the expense of only a few extrapolation steps embedded in BP 
and some additional storage for computed weights and gradient vectors. 

As an alternative to BP, a gradient-descent method without error backprop- 
agation has been proposed for MLP learning [10]. Unlike BP, the method feeds 
gradients forward rather than feeding errors backwards. The gradients of the 
final output are determined by feeding the gradients of the intermediate outputs 
forward at the same time that the outputs of the intermediate layers are fed for- 
ward. This method turns out to be equivalent to BP for a three-layer MLP, but is 
much more readily extended to arbitrary number of layers without modification. 
This method has a great potential for concurrency. 

Lyapunov stability theory has been applied for weight update [147, 8, 77]. 
Lyapunov’s theory guarantees convergence under certain sufficient conditions. 
In [147], a generalization of BP is developed for training feedforward networks. 
The BP, Gauss-Newton, and Levenberg-Marquardt (LM) algorithms are special 
cases of this general algorithm. The general algorithm has the ability to handle 
time-varying inputs. The LF I and II algorithms [8], as two adaptive versions of 
BP, converge much faster than the BP and EKF algorithms to attain the same 
accuracy. In addition, sliding mode control-based adaptive learning algorithms 
have been used to train adalines [110] and multilayer networks [92] with good 
convergence and robustness. 

The successive approximative BP algorithm [69] can effectively avoid local 
minima. Given a set of N pattern pairs { (£p, Yp) ts all the training patterns 
are normalized so that |zp j| < 1, |Yyp,k| < 1, for p = 1,..., N, j = 1,..., J1, k = 
1,..., Jm. The training is composed of Nphase successive BP training phases, each 
being terminated when a predefined accuracy 6;, i = 1,..., Nphase, is achieved. 
At the first phase, the network is trained using BP on the training set. After 
accuracy 6; is achieved, the output of the network for the N input {a,} are 
{Hp(1)} and the weights are W(1). Calculate output errors dy,(1) = Yp — ĝp(1) 
and normalize each dy,,(1) so that |dyp,.(1)| < 1. In the second phase, the N 
training patterns are {(ap,dy,(1))}. The training terminates at accuracy 62, 
with weights W(2), and output {Gp(2)}. Calculate dy, (2) = dy,(1) — 9,(2) and 
normalized dy,(2). This process continues up to phase Nphase with accuracy 
ÔN nase aNd weights W (Nphase). The final training error is given by 


Nopnase 
E< 2s |] 2. (4.52) 
i=1 
If all 6; < 4, as Nphase > 00, E — 0. Successive approximative BP empirically 
significantly outperforms BP in terms of convergence speed and generalization 
performance. 
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BP with global descent 


The gradient-descent method is a stochastic dynamical system whose stable 
points only locally minimize the energy (error) function. The global-descent 
method, which is based on a global optimization technique called terminal 
repeller unconstrained subenergy tunneling (TRUST) (12, 4], is a deterministic 
dynamic system consisting of a single vector differential equation. TRUST was 
introduced for general optimization problems, and it formulates optimization in 
terms of the flow of a special deterministic dynamical system. 

Global descent is a gradient-descent method using a special criterion function. 
The derived update automatically switches between two phases: the tunneling 
phase and the local-search phase. At the tunneling phase, the terminal repeller 
term dominates and the local minimum becomes a repelling unstable equilibrium 
point, the solution will be repelled from the neighborhood of a local minimum 
until it reaches a lower basin of attraction. At the local-search phase, the repeller 
term is identically zero; this phase implements gradient descent and finds a local 
munimum in a new region. The two phases alternate until a stopping criterion 
is achieved. The global-descent rule replaces the gradient-descent rule for MLP 
learning [12, 101]. BP with tunneling for training MLP is similar to the global 
descent [12] and can find the global minimum from arbitrary initial choice in the 
weight space in polynomial time [101]. 

Another two-phase learning model has two phases: a BP phase and a gradient- 
ascent phase [116]. The BP phase performs steepest descent on an error measure. 
When BP gets stuck at local minima, the gradient-ascent phase attempts to fill 
up the valley by modifying gain parameters in a gradient-ascent direction of the 
error measure. The two phases are repeated until the network gets out of local 
minima. 

Deterministic global-descent methods usually use a tracing strategy decided 
by trajectory functions. These can be hybrid global/local minimization methods 
[4] or based on the concept of the terminal attractor [148]. 

Nonlinear dynamic systems satisfying the Lipschitz condition have a unique 
solution for each initial condition, and the trajectory of the state approaches 
the solution asympototically, but never reaches it. The concept of a terminal 
attractor was first introduced by Zak [148]. Terminal attractors are fixed points 
in a dynamic system violating the Lipschitz condition. As a result, a terminal 
attractor is a singular solution that envelopes the family of regular solutions, 
while each regular solution approaches such an attractor in finite time. The ter- 
minal attractor based BP algorithm [127, 50] applies the concept of the terminal 
attractor to enable a finite time convergence to the global minimum. In contrast 
to BP, 7 in the terminal attractor-based BP is adapted by 


(4.53) 
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where y > 0, g = VE, and h(E) is a non-negative continuous function of E. 
This leads to an error function, which evolves by 
dE 
dt 
When h(£) is selected as Æ“, with < u < 1, E will stably reach zero in time 
[50] 


= —yh(E). (4.54) 


T= mes (4.55) 

ya = u) 
According to (4.53), at local minima, 7 — oo, and the algorithm can escape from 
local minima. By selecting y and u, one can tune the time to exactly reach E = 0. 
Terminal attractor-based BP can be three orders of magnitude faster than BP 
[127]. When ||g|| is sufficiently large so that 7 < y, one can, as a heuristic, force 
n = y temporarily to speed up the convergence, that is, switch to BP temporarily. 


Robust BP algorithms 


Since BP is a special case of stochastic approximation, the techniques of robust 
statistics can be applied to BP [134]. In the presence of outliers, M-estimator 
based robust learning can be applied. The rate of convergence is improved since 
the influence of the outliers is suppressed. Robust BP algorithms using M- 
estimator-based criterion functions are a typical class of robust algorithms, such 
as the robust BP using Hampel’s tanh estimator with time-varying error cutoff 
points 3; and 6ə [15], and the annealing robust BP algorithm [18]. 

Annealing robust BP [18] adopts the annealing concept into robust learning. 
A deterministic annealing process is applied to the scale estimator. The cost 
function of annealing robust BP has the same form as (2.15), with 6 = G(t) 
as a deterministic annealing scale estimator. As G(t) — oo, annealing robust BP 
becomes BP. The basic idea of using an annealing schedule is to use a larger scale 
estimator in the early training stage and then to use a smaller scale estimator 
in the later training stage. When (6(t) — 0+ for t > co, the M-estimator is 
equivalent to the linear Z,-norm estimator. Since the L,-norm estimator is robust 
against outliers, the M-estimator equipped with such an annealing schedule is 
equivalent to the robust mixed-norm learning algorithm [20], where the L2-norm 
is used at the beginning, then gradually tending to the Lı-norm according to the 
total error. An annealing schedule ((t) = + achieves good performance, where y 
is a positive constant [18]. 

M-estimator based robust methods have difficulties in the selection of the scale 
estimator 8. Tao-robust BP algorithm [94] overcomes this problem by using a 
T-estimator. Tao-robust BP also achieves two important properties: robustness 
with a high breakdown point and a high efficiency for normal distributed data. 
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Resilient propagation (RProp) 


RProp [100] eliminates the influence of the magnitude of the partial derivative 
on the step size of the weight update. The update of each weight is according to 
the sequence of signs of the partial derivatives in each dimension of the weight 
space. 

The update for each weight or bias w 
procedure [100] 


(m) 


ij 18 given according to the following 


C= gh (t-1)- gO), (4.56) 
min {ng ay” (t = 1), Amex ; C>0 
AN) = 4 max {np AQ E- 1), Amint, <0, (457) 
(m) 7 
Aij (t-1), C=0 
i m) (m) 
< y t)) Ay (t C>0 
Au” (t) — sien (9h ( )) a ( Ji Z , (4.58) 
—Aw;; = 1), C<0 
Gt) =0, C<0, (4.59) 
whi" (t+ 1) = wi (t) + Awl” A), (4.60) 


where 0< ng <1<nt, and typically nf =1.2 and ng =0.5. The value of 
Ay (0) is not critical to the algorithm, and is selected as a positive constant Ao. 
The upper and lower bounds, denoted by Ajax and Amin, respectively, are used 
to restrict overflow /underflow problems of floating-point variables. For example, 
one can select Amax = 50.00 and Amin = 107° [100]. A smaller value of Amax 
such as 1.0 may result in a smoothened behavior of the decrease in error. 

RProp is robust against the choice of its initial parameters. In comparison 
with BP, Quickprop [29], and SuperSAB [120], the number of learning steps is 
significantly reduced and computational complexity of RProp at each step is 
considerably smaller [100, 39]. RProp has a performance comparable to that 
of the CG method [39]. It is one of the best performing first-order learning 
methods for neural networks. It is suitable for hardware implementation and is 
not susceptible to numerical problems. RProp has also been used for training 
RBF networks [7] and recurrent fuzzy neural networks [80]. 


Example 4.6: From the housing data set, there are a total of 506 example homes 
with 13 items of geographical and real estate information and their associated 
market values. We design a network that can predict the value of a house (in 
$1000s), given 13 inputs. We simulate for RProp and batch BP for 1000 epochs. 
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Figure 4.10 Illustration of the RProp algorithm. 


The learning rate for BP is 0.00008. The MSE for training is shown in Fig. 4.10. 
It is shown that RProp is two orders of magnitude faster than BP. 


SASS [38] uses the same update rule as RProp, but the update of Aw” (t) is 
based on the bisection method for minimization in one dimension and uses two 
previous signs 


19 


Aa poA M-i, CSO and gf" t-32 S 
üA (t — 1), otherwise l 


SASS provides a performance comparable to that of RProp [39]. 

Some variants aiming at improving RProp are available. QRprop and diagonal 
estimation Rprop (DERprop) [96] are two similar hybrids of Rprop and second- 
order search steps. They adaptively switch between the two methods by using 
the strategy of Rprop and switching to second-order approximation only when 
the search is in the vicinity of a local minimum. QRprop makes use of local one- 
dimensional secant steps, which are used in Quickprop. DERprop directly com- 
putes the diagonal elements of the Hessian. The addition of simulated annealing 
in the form of noise and weight decay to RProp yields the SARProp algorithm 
[121]. 

In RProp, if the partial derivatives in consecutive steps possess the same sign, 
then weight is updated and, if consecutive partial derivatives possess the opposite 
sign, that is, C < 0, then previous weight update is reverted. The change in 
sign in successive steps is considered as a jump over the minima. This weight- 
backtracking is counterproductive when the overall error has decreased during 
the change in sign of the partial derivatives. In improved RProp [46] the previous 
weight update is reverted only when it has caused C < 0 in case of an overall error 
increase, that is, E(t) > E(t — 1). Improved Rprop outperforms Rprop and CG, 
and has a performance comparable to that of the BFGS method. The complex 
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RProp algorithm with error-dependent weight backtracking step [124] extends 
the improved RProp algorithm [46] to complex domain. 

GRprop [3] is a globally convergent modification of Rprop. It is built on a 
mathematical framework for convergence analysis [75], which ensures that the 
adaptive local learning rates of Rprop’s schedule generate a descent-search direc- 
tion at each iteration. GRprop exhibits a better convergence speed and stability 
than Rprop or improved Rprop does. 


4.1 Construct a 2-1-1 network that computes the XOR function of two inputs. 
(a) Plot the network structure, and put the learned weights and biases on the 
graph. 

(b) Give the learning steps. 

(b) Express the decision regions in a two-dimensional plot. 

(c) Construct a truth table for the network operation. 


4.2 Parity is cyclic shift invariant. A shift of any bits generates the same parity. 
Use MLP to: 

(a) Learn the parity function. 

(b) Model 4-bit parity checker (logic problem). 


1 0.4 
ati | 
For the LMS algorithm, what is the suitable range of the learning rate 7 for the 
algorithm to be convergent? 


4.3 The correlation matrix of the input vector a; is given by R = | 


4.4 In Example 4.4, we use different values of a single node to represent different 
classes. A more common architecture of a neural network classifier is usually 
defined according to the convention from a competitive learning based classifier. 
To classify a pattern set with Jı-dimensional inputs into K classes, the neural 
network is selected usually as having Jı input nodes, K output nodes with each 
corresponding to one class, and zero or multiple hidden layers. During training, 
if the target class is k, the target value of the Ath output node is set to 1 and 
those of all the other output nodes are set to 0. A pattern is considered to be 
correctly classified if the target output node has the highest output among all 
the output nodes. 

Simulate the iris data set using a 4-4-3 MLP by BP and BP-with-momentum, 
and compare the result with that given in Example 4.4. All the learning param- 
eters are selected to be the same as those for Example 4.4. 


4.5 Approximate the following functions by using MLP: 
(a) f (1,22) = max{e7*7, e722, Qe 05x +43}, 

(b) f(x,y) = 0.5 + 0.12? cos(y + 3) + 0.4arye!-”. 

(c) f(a) = sin (V27 F23) / V27F a3, w e [-5, 5)? 

(d) f 


d) f(x) = V2 sinz + V2 cosx — V2 sin 3x + V2 cos 3z. 


ww ai bt. com DOOOO00 


122 


Chapter 4. Multilayer perceptrons: architecture and error backpropagation 


4.6 Why is the MSE not a good measure of performance for classification prob- 
lems? 


4.7 In [130], Ee is defined as 


where wo is a free parameter. For small wo, the network prefers large weights. 
When wo is taken as unity, this penalty term decays the small weights more 
rapidly than the large weights [37]. For large wo, the network prefers small 
weights. Derive 22s 





4.8 Derive the gradient descent rule when the error function is defined by E = 
4 oy (te — yk)? + Dij Wiis where yx is the network output for the kth input. 


4.9 Derive the forward and backward propagation equations for each of the 
loss functions: 

(a) Kullback-Leibler divergence criterion (2.23), (2.24), (2.25). 

(b) The cross-entropy cost function (2.26). 

(c) The Minkowski-r metric (2.27). 


4.10 Write a program implementing the three-layer MLP with the BP rule, 
using each of the criteria listed in Problem 1.9. Test the program using the 
different criteria on the iris classification problem. 


4.11 Write a program implementing the three-layer MLP with the BP rule, 
incorporating the weight-decaying function. 

(a) Generate 200 points from y = ¢(x + 2x2) + 0.5(@1 — £2)? +0.5N, where 
9(-) is the logistic sigmoidal function, and N is a number drawn from the stan- 
dard normal distribution. Apply the program on the samples. 

(b) Test the program using 1000 randomly generated samples. 

(c) Plot the training and testing errors versus the number of training epochs for 
differing weight decay parameters. 

(d) Describe the overfitting phenomenon observed. 


4.12 Show that the derivative of the softmax function, y; = ST is Sui = 
j j 





Yilðij — yj), where 6;; = 0 for i = j and 0 otherwise. 


4.13 This exercise is excerpted from [9]. Show that adding noise to the input 
vectors has the same effect as a weight-decaying regularization for a linear net- 
work model: 


Yk = > Witt + Wko 


(3 
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and a sum-of-squares error function 


N 
1 2 
E=3N XO D {ur(En) — tn)’, 
n=l k 
where N is the training set size, and tn, is the target values. Assume that 
random noise components £; are Gaussian €; = N (0, v), and E(ei£;) = div. 


4.14 Show that the OBD algorithm for network pruning is a special case of the 
OBS algorithm. 


4.15 Show how to find the inverse of an invertible matrix by using the BP 
algorithm. 
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Multilayer perceptrons: other 
learing techniques 


Introduction to second-order learning methods 


Training of feedforward networks can be viewed as an unconstrained optimization 
problem. BP is slow to converge when the error surface is flat along a weight 
dimension. Second-order optimization techniques have a strong theoretical basis 
and provide significantly faster convergence. Second-order methods make use of 
the Hessian matrix H, that is, the second-order derivative of the error Æ with 
respect to the N,,-dimensional weight vector w, which is a vector obtained by 
concatenating all the weights and biases of a network: 
OPE 


HU) = aap 
t 


(5.1) 
It is an Ny X Ny matrix. This matrix contains information as to how the gradient 
changes in different directions of the weight space. The calculation of H can be 
implemented into the BP algorithm [14]. For feedforward networks, H is ill- 
conditioned [75]. 

Second-order algorithms can either be of matrix or vector type. Matrix-type 
algorithms require the storage for the Hessian and its inverse. The Broyden- 
Fletcher-Goldfarb-Shanno (BFGS) method [25] and a class of Newton’s methods 
are matrix-type algorithms. Matrix-type algorithms are typically two orders of 
magnitude faster than BP. The computational complexity is at least O (N2) 
floating-point operations, when used for supervised learning of MLP. 

Vector-type algorithms, on the other hand, require the storage of a few vec- 
tors. Examples of such algorithms include the limited-memory BFGS [7], one- 
step secant [8, 9], scaled CG [56], and CG methods [39, 86]. They are typically 
one order of magnitude faster than BP. Vector-type algorithms require iterative 
computation of the Hessian or implicitly exploit the structure of the Hessian. 
They are based on line-search or trust-region-search methods. 

In BP, the selection of the learning parameters by trial-and-error is a daunting 
task for a large training set. In second-order methods, learning parameters can be 
automatically adapted. However, second-order methods are required to be used 
in batch mode due to the numerical sensitivity of the computation of second- 
order gradients. 
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Newton’s methods 


Newton’s methods [8, 6] require explicit computation and storage of the Hessian. 
They are variants of the classical Newton’s method. These include the Gauss- 
Newton and Levenberg-Marquardt (LM) methods. Newton’s methods achieves 
the quadratic convergence. They are less sensitive to the learning constant, and 
a proper learning constant is easily selected. 

At step t+ 1, we expand E(w) into a Taylor series 


E (Ñ) l1 = E (Ñ) le + [WE + 1) - BO)’ g) 
+ i [w(t +1) — W(t)” HA wt- (5.2) 
where the gradient vector is given by 
g(t) = VE (W (t)) = VE(w)|t. (5.3) 
Equating g(t + 1) to zero: 
g(t +1) = g(t) + H(t) (W(t + 1) — W(t) +... =0. (5.4) 


By ignoring the third- and higher-order terms, the classical Newton’s method is 
obtained: 


w(t+1) = w(t) + d(t), (5.5) 


d(t) = -H (t)g(¢). (5.6) 


For MLP, the Hessian is a singular matrix [88], and thus (5.6) cannot be 
used. Nevertheless, we can make use of (5.4) and solve the following set of linear 
equations for the step d(t) 


g(t) = -H (t)d(t). (5.7) 


This set of linear equations can be solved by using SVD or QR decomposition. 

From second-order conditions, H(t) must be positive for searching a minimum. 
At each iteration, E is approximated locally by a second-order Taylor polyno- 
mial, which is minimized subsequently. This minimization is computationally 
prohibitive, since computation of H(t) needs global information and solution of 
a set of linear equations is also required [15]. In the classical Newton’s method, 
O (N3) floating-point operations are needed for computing the search direction; 
however, it is not suitable when w(t) is remote from the solution, since H(t) 
may not be positive-definite. 
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Gauss-Newton method 


Denote E(w) as 


E(@) = z240) = see (5.8) 
where € (w) = (e (w) „E2 (w) p- EN (w))’, €; = |le;||, and ei = ĝi — yi. 


Thus, €? =e? e;. 
The gradient vector is obtained by 














gù) = = = J” (w) €(w), (5.9) 
where J (w), an N x Nu Jacobian matrix, is defined by 
aa Oe (w) Oe; 
J(w) = aT [Ji] = Ea . (5.10) 
Further, the Hessian is obtained by 
su _griaty (w) +S (w) (5.11) 
Ow 
where 
N 
s (B) = S76; (W) V? (@). (5.12) 


Assuming that S (w) is small, we approximate the Hessian using 
Hen(t) = JT (t)J(t), (5.13) 


where J(t) denotes J (w(t)). 
In view of (5.5), (5.6) and (5.9), we obtain 


w(t+1) = w(t) + d(t), (5.14) 


d(t) = -Hgh(t)J" (telt), (5.15) 


where e(t) denotes e (w (t)). The above procedure is the Gauss-Newton method. 

The Gauss-Newton method approximates the Hessian using information from 
first-order derivatives only. However, far away from the solution, the term S is 
not negligible and thus the approximation to the Hessian H is poor, resulting 
in slow convergence. The Gauss-Newton method may have an ill-conditioned 
Jacobian matrix and H may be noninvertible. In this case, like in the classical 
Newton’s method, one can instead solve Hen(t)d(t) = —J7(t)e(t) for d(t). For 
every pattern, the BP algorithm requires only one backpropagation process, while 
in second-order algorithms the backpropagation process is repeated for every 
output separately in order to obtain consecutive rows of the Jacobian. 
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An iterative Gauss-Newton method based on the generalized secant method 
using Broyden’s approach is given as [25] 


q(t) = e(t +1) — e(t), (5.16) 
[a(t) — IHAT (t) 

d” (t)d(t) 
The method uses the same update given by (5.14) and (5.15). 


J(t+1) =J(t) + (5.17) 


Levenberg-Marquardt method 


The LM method [57] eliminates the possible singularity of H by adding a small 
identity matrix to it. This method is derived by minimizing the quadratic approx- 
imation to E (W) subject to the constraint that the step length ||d(¢)|| is within 
a trust region at step t. At given w(t), the second-order Taylor approximation 
of E (w) is given by 


i 1 
É (w(t) + d(t)) = E (w(t) + g(t)” d(t) + z (H(t) dt). (5.18) 
The search step d(t) is computed by solving the trust-region subproblem 
min B (w(t) +d(t)) subject to ||d(t)|| < 6, (5.19) 
t 
where 6; is a positive scalar and {d(t)|||d(t)|| < 6;} is the trust region around 
w(t). 


This inequality constrained optimization problem can be solved by using the 
Karush-Kuhn-Tucker (KKT) theorem [25], which leads to 


Hym(t) = H(t) + o(t)I, (5.20) 
where a(t) is a small positive value, which indirectly controls the size of the trust 
region. 

The LM modification to the Gauss-Newton method is given as [31] 


H(t) = Hen(t) + o()L (5.21) 


Thus, Hgm is always invertible. The LM method given in (5.21) can be treated 
as a trust-region modification to the Gauss-Newton method [8]. 

The LM method is based on the assumption that such an approximation to 
the Hessian is valid only inside a trust region of small radius, controlled by ø. If 
the eigenvalues of H are Ay > Ag >... > An,,, then the eigenvalue of Hum are 
ài +0, i= 1,..., Nw, with the same corresponding eigenvectors. ø is selected so 
that Hgm is positive-definite, that is, Aw,, +ø > 0. As a result, the LM method 
eliminates the singularity of H for MLP. 

The LM method is therefore given by 


w(t+1) = w(t) +d(t), (5.22) 
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d(t) = -H74 (t) IT (Helt). (5.23) 


For large o, the algorithm reduces to BP with 7 = Z, However, when ø is small, 
the algorithm reduces to the Gauss-Newton method. Thus, there is a tradeoff 
between the fast learning speed of the classical Newton’s method and the guar- 
anteed convergence of the gradient descent. o can be adapted by [31] 


i= pz 1)y, if E(t) > E(t—1) 


— if E(t) < E(t-1)’ 


(5.24) 


where y > 1 is a constant. Typically, o(0) = 0.01 and y = 10 [31]. The compu- 
tation of the Jacobian is based on a simple modification to the BP algorithm 
[31]. Other methods for selecting o(t) inlude the hook step [57], Powell’s dogleg 
method [25], and some rules of thumb [25]. 

Newton’s methods for MLP lack iterative implementation of H, and the com- 
putation of H~! is also expensive. They also suffer from the ill-representability 
of the diagonal terms of H and the requirement of a good initial estimate of the 
weights. The LM method is a trust-region method with a hyperspherical trust 
region. It is an efficient algorithm for medium-sized neural networks [31]. The LM 
method demands large memory space to store the Jacobian, the approximated 
Hessian, and the inversion of a matrix at each iteration. In [24], backpropagation 
is used for the matrix-matrix multiplication in the Gauss-Newton matrix; this 
reduces the running time of the LM method by a factor of O(Jm), where Jm is 
the number of output nodes. 

There are some variants of the LM method. A modified LM method [89] is 
obtained by modifying the error function and using the slope between the desired 
and actual outputs in the activation function to replace the standard derivative at 
the point of the actual output. This method gives a better convergence rate with 
less computational complexity and reduces the memory requirement from N2 
to Jĝ; allocations. The trust-region based error aggregated training (TREAT) 
algorithm [20] is similar to the LM method, but uses a different Hessian matrix 
approximation based on the Jacobian matrix derived from aggregated error vec- 
tors. The new Jacobian is significantly smaller. The size of the matrix to be 
inverted at each iteration is also reduced by using the matrix inversion lemma. 
A recursive LM algorithm for online training of neural networks is given in [60] 
for nonlinear system identification. 

The disadvantages of the LM method as well as of Newton’s methods can 
be alleviated by the block Hessian based Newton’s method [88], where a block 
Hessian matrix Hy is defined to approximate and simplify H. Each W ™), or its 


vector form w ™), corresponds to a diagonal partition matrix H™, and 


H, = blockdiag (H? ,H®,..., H9), (5.25) 


H; is proved to be a singular matrix [88]. In LM implementation, the inverse of 
H; + oI can be decomposed into the inverse of each diagonal block HW”) + oI, 
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and the problem is decomposed into M — 1 subproblems 
z (m) D 
Aw) = (H$ + o1) g™, m=l,...,M-1, (5.26) 


where the gradient partition g° = ae The inverse in each subproblem can 
be computed recursively according to the matrix inversion lemma. 

LM with adaptive momentum (LMAM) and optimized LMAM [3] combine 
the merits of both the LM and CG techniques, and help the LM method escape 
from the local minima. LMAM is derived by optimizing the mutually conjugate 
property of the two steps subject to a constraint on the error change as well as a 
different trust-region condition d” (t)H(t)d(t) < ô+. This leads to two parameters 
to be tuned. Optimized LMAM is adaptive, requiring minimal input from the 
end user. LMAM is globally convergent. Their implementations require minimal 
additional computations when compared to the LM iteration, and this is, how- 
ever, compensated by their excellent convergence properties. Both the methods 
generate better results than LM, BFGS, and Polak-Ribiere CG with restarts [39]. 

LM training is restricted by the memory requirement for large pattern sets. Its 
implementations require calculation of the Jacobian matrix, whose size is propor- 
tional to the number of training patterns N. In an improved LM algorithm [91], 
quasi-Hessian matrix and gradient vector are computed directly, without Jaco- 
bian matrix multiplication and storage. Memory requirement for quasi-Hessian 
matrix and gradient vector computation is decreased by N x Jm times, where 
Jm is the number of outputs. Exploiting the symmetry of quasi-Hessian matrix, 
only elements in its upper/lower triangular array are calculated. Therefore, mem- 
ory requirement and training speed are improved significantly. 

A forward-only LM method [92] uses the forward-only computation instead 
of the traditional forward and backward computation for calculation of the ele- 
ments of the Jacobian. Information needed for the gradient vector and Jaco- 
bian or Hessian matrix is obtained during forward computation. The forward- 
only method gives identical number of training iterations and success rates as 
LM does, since the same Jacobian matrices are obtained. The LM algorithm 
has been adapted for arbitrarily connected neural networks [90], which can 
handle a problem of same complexity with a much smaller number of neu- 
rons. The forward-only LM method allows for efficiently training arbitrarily 
connected networks. For networks with multiple outputs, the forward-only LM 
method (http: //www.eng.auburn.edu/users/wilambm/nnt/) has a lower com- 
putational complexity than the traditional forward and backward computations 
do [31], [90]. 


Quasi-Newton methods 


Quasi-Newton methods approximate Newton’s direction without evaluating 
second-order derivatives of the cost function. The approximation of the Hessian 
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or its inverse is computed in an iterative process. They are a class of gradient- 
based methods whose descent direction vector d(t) approximates Newton’s direc- 
tion. Notice that in this subsection, d(t) denotes the descent direction, and s(t) 
the step size; in Newton’s methods, the two vectors are equivalent and are rep- 
resented by d(t): 


d(t) = —H~'(t)g(t). (5.27) 
Thus, d(t) can be obtained by solving a set of linear equations: 
H(t)d(t) = —g(t). (5.28) 


The Hessian is always symmetric and is often positive-definite. Quasi-Newton 
methods with positive-definite Hessian are called variable-metric methods. Secant 
methods are a class of variable-metric methods that use differences to obtain 
an approximation to the Hessian. The memory requirement for quasi-Newton 
methods is 4N2 + O (Nw), which is the same as that for Newton’s methods. 
These methods approximate the classical Newton’s method, thus convergence is 
very fast. 

The line-search and trust-region methods are two globally convergent strate- 
gies. The line-search method tries to limit the step size along Newton’s direction 
until it is unacceptably large, whereas in the trust-region method the quadratic 
approximation of the cost function can be trusted only within a small region 
in the vicinity of the current point. Both methods retain the rapid-convergence 
property of Newton’s methods and are generally applicable [25]. 

In quasi-Newton methods, a line search is applied such that 


A(t) = arg min E (w(t+1)) = arg min E (w(t) + Ad(t)) . (5.29) 


Line search is used to guarantee that at each iteration the objective function 
decays, which is dictated by the convergence requirement. The optimal A(t) can 
be theoretically derived from 

o 

On 
and this yields a representation using the Hessian. The second-order derivatives 
are approximated by the difference of the first-order derivatives at two neighbor- 
ing points, and thus A is calculated by 


a(t)? [g,(t) — g(t)] a(t) 
where g,(t) = Vg (w(t) + rd(t)), and the size of neighborhood 7 is carefully 
selected. Some inexact line-search and line-search-free optimization methods are 


applied to quasi-Newton methods, which are further used for training feedforward 
networks [10]. 


E (w(t) + Ad(t)) = 0, (5.30) 


(5.31) 
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There are many secant methods of rank one or rank two. The Broyden family 
is a family of rank-one and rank-two methods generated by taking [25] 


H(t) = (1 — ¥)Hprp(t) + VHeras(t), (5.32) 


where Hprp and Hprcs are, respectively, the Hessain obtained by the Davidon- 
Fletcher-Powell (DFP) and BFGS methods, and ¥ is a positive constant between 
0 and 1. By giving different values for v, one can get DFP (V = 0), BFGS (V = 1), 
or other rank-one or rank-two formulae. DFP and BFGS are two dual rank-two 
secant methods, and BFGS emerges as a leading variable-metric contender in 
theory and practice [58]. Many of the properties of DFP and BFGS are common 
to the whole family. 


BFGS method 


The BFGS method [7, 25, 58] is implemented as follows. Inexact line search can 
be applied to BFGS, and this significantly reduces the number of evaluations of 
the error function. The Hessian H or its inverse is updated by 


H(t)s(t)s?(t)H(t) — z(t)z7(t) 


H(t +1) =H) -FOO t (5.33) 
H+ 1) =H + (14 See soea 
qa a, pa 
where 
z(t) = g(t +1) — g(t), (5.35) 
s(t) = W(t +1) — w (t). (5.36) 





For BFGS implementation, wW (0), g(0), and H~1(0) are needed to be specified. 
H~'(0) is typically selected as the identity matrix. The computational com- 
plexity is O (N2) floating-point operations. The method requires storage of the 
matrix H~!. By interchanging H ~ H-t, s e z in (5.33) and (5.34), one can 
obtain the DFP method [25, 58}. 

All the secant methods including the BFGS method are derived to satisfy the 
so-called quasi-Newton condition or secant relation [25, 58] 


H(t + 1)z(t) = s(t). (5.37) 


In [76], a small-memory efficient second-order learning algorithm has been 
proposed for three-layer neural networks. Descent direction is calculated on the 
basis of a partial BFGS update with 2N¢ memory space (t « N), and a rea- 
sonably accurate step length is efficiently calculated as the minimal point of a 
second-order approximation to the objective function with respect to the step 
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length. The search directions are exactly equivalent to those of the original BFGS 
update during the first t + 1 iterations. 

Limited-memory BFGS methods implement parts of the Hessian approxima- 
tion by using second-order information from the most recent iterations [7]. A 
number of limited-memory BFGS algorithms, which have a memory complexity 
of O(N) and do not require accurate line searches, are listed in [79]. In the 
trust-region implementation of BFGS [64], Powell’s dogleg trust-region method 
is used to solve the constrained optimization subproblems. Other variants of 
quasi-Newton methods are the variable-memory BFGS [53] and memory-optimal 
BFGS methods [55]. A class of limited-memory quasi-Newton methods is given 
in [16]. These methods utilize an iterative scheme of a generalized BFGS-type 
method, and suitably approximate the whole Hessian matrix with a rank-two 
formula determined by a fast unitary transform such as the Fourier, Hartley, 
Jacobi type, or trigonometric transform. It has a computational complexity of 
O (Nw log (Nw)) and requires O (Nu) memory allocations. The close relationship 
between the BFGS and CG methods is important for formulating algorithms 
with variable storage or limited memory [58]. Memoryless or limited-memory 
quasi-Newton algorithms can be viewed as a tradeoff between the CG and quasi- 
Newton algorithms, and are closely related to CG. 


One-step secant method 


The one-step secant method [8, 9] is a memoryless BFGS method, and is obtained 
by resetting H~!(t) as the identity matrix in the BFGS update equation (5.34) 
at the (t + 1)th iteration, and multiplying both sides of the update by —g(t + 1) 
to obtain the search direction 


d(t + 1) = -g(t +1) + B(t)z(t) + Cts (t), (5.38) 
where 
E sT(t)g(t +1) 
B(t) Tze) (5.39) 
c=- (1 n iii) ppe n (5.40) 
sT (t)z(t) sT(tzt) © ` 


The one-step secant method does not store the Hessian, and the new search 
direction can be calculated without computing a matrix inverse. It reduces the 
computational complexity to O(N.,). However, this results in a considerable 
reduction of second-order information, and thus yields a slow convergence com- 
pared to the BFGS. When exact line search is applied, the one-step secant 
method generates conjugate directions. Both the BFGS and one-step secant 
methods are two efficient methods for MLP training. Parallel implementations of 
the two algorithms are discussed in [54]. A parallel secant method of Broyden’s 
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family with parallel inexact searches is developed and applied for the training of 
feedforward networks [66]. 


Conjugate-gradient methods 


The CG method [35, 8, 56, 18] is a popular alternative to BP. It has many tried 
and tested, linear and nonlinear variants, each using a different search direction 
and line-search method. Mathematically, the CG method is closely related to the 
quasi-Newton method. The CG method conducts a series of line searches along 
noninterfering directions that are constructed to exploit the Hessian structure 
without explicitly storing it. The storage requirement is O (Nw), and is about 
four times that for BP [40]. The computation time per weight update cycle 
is significantly increased due to the line search for an appropriate step size, 
involving several evaluations of either the error E or its derivative, which requires 
the presentation of the complete training set. 

The CG method conducts a special kind of gradient descent. It constructs 
a set of Nu linearly independent, nonzero search directions d(t), d(t + 1), ..., 
d(t+ Nu — 1), t = 0,1,... These search directions are derived from (5.30), and, 
at the minimum of the line search, satisfy 


g7 (t+ 1)d(t) = 0. (5.41) 


Based on this, one can construct a sequence of Nwy successive search directions 
that satisfy the so-called H-conjugate property 


d'(t+i)Hd(t+j)=0, if 9, VO<|i-g|<Ny—1. (5.42) 

The CG method is updated by 
w(t+1) = w(t) + A(t)d(t), (5.43) 
d(t + 1) = —g(t + 1) + B(t)d(2), (5.44) 


where A(t) is the exact step to the minimum of E (w(t + 1)) along the direction 
of d(t) and is found by a linear search as given in (5.29), and {(t) is a step 
size to decide d(t +1). A comparison of the search directions of the gradient 
descent, Newton’s methods, and the CG are illustrated in Fig. 5.4, where ao 
and æ* are, respectively, the starting point and the local minimum; go, g),... 
are negative gradient directions, and are the search directions for the gradient- 
descent method; ay is the Newton direction, and it generally points towards the 
local minimum; dor, a. ... are the conjugate directions. Notice that ase and 
go are the same. The contours denote constant values of E. 

A practical implementation of the line-search method is to increase À until 
E (w(t) +Ad(t)) stops being strictly monotonically decreasing and begins to 
increase. Thus, a minimum is bracketed. The search in the interval between the 
last two values of \ is then repeated several times until E (w (t) +Ad(t)) is 
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Figure 5.1 The search directions of the gradient-descent, Newton’s, and CG methods. 


sufficiently close to a minimum. Exact line search is computationally expensive, 
and the speed of the CG method depends critically on the line-search efficiency. 
Faster CG algorithms with inexact line search [22, 79] are used to train MLP 
[30]. The scaled CG algorithm [56] avoids line search by introducing a scalar to 
regulate the positive-definiteness of the Hessian H as used in the LM method. 
This is achieved by automatically scaling the weight update vector magnitude 
using Powell’s dogleg trust-region method [25]. 
The selection of 8(t) can be one of the following: 


g” (t+ 1)2(t) 








eO -O (Hestenes-Stiefel [35]), (5.45) 
a(t) = gis ni — D (Fletcher-Reeves [26]), (5.46) 
A(t) = Po (Polak-Ribiere [67]), (5.47) 
a(t) = riot on DY E ian (Oi), (5.48) 
a(t) = -Lt "a 1). (Sonjudatis descent 128): (5.49) 


where z(t) is defined as in (5.35). In the implementation, w (0) is set as a random 
vector, and we set do = —g(0). When ||g(t)|| is small enough, we terminate the 
process. The computational complexity of the CG method is O (Nw). 

When the objective function is strict convex quadratic and an exact line search 
is applied, G(t) is identical for all the five choices, and termination occurs at 
most in NV, steps [25, 58]. With periodic restarting, all the above nonlinear CG 
algorithms are well known to be globally convergent [58]. A globally convergent 
algorithm is an iterative algorithm that converges to a local minimum from 
almost any starting point. Polak-Ribiere CG with Powell’s restart strategy [68] 
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is considered to be one of the most efficient methods [86, 58]. Polak-Ribiere CG 
with restarts forces 3 = 0 whenever 8 < 0. This is equivalent to forgetting the 
last search direction and restarting it from the direction of steepest descent. 

Powell proposed a popular restarting procedure [68]. It tests if there is 
very little orthogonality between the current gradient and the previous one. If 
Igt)" g(t — 1)| > 0.2||g(t)||? is satisfied, the CG search direction is restarted by 
the steepest descent direction —g(t). 

In [44], Perry’s CG method [65] is used for MLP training. In addition, self- 
scaled CG is derived from the principles of the Hestenes-Stiefel, Fletcher-Reeves, 
Polak-Ribiere and Perry’s methods. This class is based on the spectral scaling 
parameter. An efficient line-search technique is incorporated into the CG algo- 
rithms based on the Wolfe conditions and on safeguarded cubic interpolation. 
Finally, an efficient restarting procedure is employed in order to further improve 
the effectiveness of the CG algorithms. 

Empirically, the local minimum achieved with BP will, in general, be a solution 
that is good enough for most purposes. In contrast, the CG method is easy to 
be trapped at a bad local minimum, since the CG method moves towards the 
bottom of whatever valley it reaches. Escaping a local minimum requires an 
increase in Æ, and this is excluded by the line-search procedure. Consequently, 
the convergence condition can never be reached [40]. The CG method is usually 
applied several times with different random w(0) for the minimum error [40]. 

The CG method can be regarded as an extension of BP with momentum 
by automatically selecting appropriate learning rate 7(¢) and momentum factor 
a(t) in each epoch [18, 86, 97, 12]. The CG algorithm can be considered as 
BP with momentum, which has adjustable ņ(t) and a(t), n(t) = A(t), a(t) = 
A(t) 8(t) [18]. By an adaptive selection of both 7(¢) and a(t) for a quadratic error 
function, referred to as optimally tuned, BP with momentum is proved to be 
exactly equivalent to the CG method [12]. 

In [51], MLP is first decomposed into a set of adalines, each having its own 
local MSE function. The desired local output at each adaline is estimated based 
on error backpropagation. Each local MSE function has a unique optimum, which 
can be found within finite steps by using the CG method. By using a modified 
CG that avoids the line search, the local training method achieves a signifi- 
cant reduction in the number of iterations and the computation time. Given the 
approximation accuracy, the local method requires a computation time that is 
typically an order of magnitude less than that of the CG-based global method. 
The local method is particularly suited to parallel implementation. 


Example 5.1: Iris classification is revisited. Eighty per cent of the data set 
is used as training data, and the remaining 20% as testing data. We set the 
performance goal as 0.001, and the maximum number of epochs as 1000. We 
simulate and compare eight popular MLP learning algorithms, namely, RProp, 
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Table 5.1. Performance comparison of a 4-4-3 MLP trained with ten learning algorithms. 


Algorithm 


RP 
BFGS 
OSS 
LM 
SCG 
CGB 
CGF 
CGP 


Mean Training Classification std Mean training std. (s) 
epochs MSE accuracy(%) time (s) 

985.92 0.0245 96.27 0.0408 9.0727 0.5607 
146.46 0.0156 96.67 0 2.8061 1.6105 
999.16 0.0267 96.27 0.0354 16.1422 0.4315 
328.80 0.0059 93.33 0 5.7625 6.9511 
912.44 0.0144 96.13 0.0371 13.4404 2.9071 
463.54 0.0244 95.07 0.0468 8.4726 5.9076 
574.26 0.0202 95.80 0.0361 10.5986 5.5025 
520.34 0.0216 96.40 0.0349 9.5469 6.0092 


RP—Rprop, OSS—one-step secant, SCG—scaled CG, CGB—CG with Powell-Beale restarts, 
CGF—Fletcher-Powell CG, and CGP—Polak-Ribiere CG. 


BFGS, one-step secant, LM, scaled CG, CG with Powell-Beale restarts, Fletcher- 
Powell CG, and Polak-Ribiere CG algorithms. 

We select a 4-4-3 MLP network. At the training stage, for class i, the ith 
output node has an output 1, and the other two output nodes have value —1. 
We use the logistic sigmoidal function in the hidden layer and the linear function 
in the output layer. At the generalization stage, only the output node with the 
largest output is treated as 1 and outputs at the other nodes are treated as —1. 
The training results for 50 independent runs are listed in Table 5.1. The learning 
curves for a random run of these algorithms are shown in Fig. 5.2. For this 
example, we see that BFGS and LM usually generate better MSE performance, 
in less time, but more memory. All the algorithms generate good MSE and 
classification performance. 


Example 5.2: Character recognition is a classical problem in pattern recogni- 
tion. A network is to be designed and trained to recognize the 26 capital letters 
of the alphabet. An imaging system that digitizes each letter centered in the sys- 
tem’s field of vision is available. The result is that each letter is represented as a 
5 x 7 grid of Boolean values, or 35-element input vectors. The input consists of 
26 5 x 7 arrays of black or white pixels. They are vectorized as a linear array of 
zeros and ones. For example, for character ’A’ (shown in Fig. 5.3), we represent 
it as 00100 01010 01010 10001 11111 10001 10001, or the value of the ASCII code 
6510. This is a classification problem of 26 output classes. Each target vector is 
a 26-element vector with a 1 in the position of the letter it represents, and 0 
everywhere else. For example, the letter ’A’ is to be represented by a 1 in the 
first element (as ’A’ is the first letter of the alphabet), and 0 in the remaining 
25 elements. 


ww ai bbt.com DOOOO00 


146 


Chapter 5. Multilayer perceptrons: other learing techniques 


























Epochs 


Figure 5.2 Iris classification using a 4-4-3 MLP trained with ten learning methods: the traces of the 
training error for a random run. t corresponds to the number of epochs. 
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Figure 5.3 The 5 x 7 dot-matrix image of all the 26 characters. 





We train the network using BFGS. The goal of training is 0.01 or the maximum 
number of epochs is 100. We use 40 hidden nodes for training. The logistic 
sigmoidal function is selected for hidden neurons and linear function is used for 
the output neurons. We implement two schemes. In the first scheme, we train 
the network using the desired examples, and we then generalize it using noisy 
samples. In the second scheme, the network is first trained by using desired 
examples, and then trained by using noisy examples of 4 different levels with 
noise of mean 0 and standard deviation of 0.3 or less, and finally trained by the 
desired samples again. The classification errors for the two schemes are shown in 
Fig. 5.4. From the result, the network trained with noisy samples have a much 
better generalization performance. 


Extended Kalman filtering methods 


The EKF method belongs to second-order methods. EKF is an optimum filter- 
ing method for a linear system resulting from the linearization of a nonlinear 
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Figure 5.4 Samples of normalized digits from the testing set. 


system. It attempts to estimate the state of a system that can be modeled as 
a linear system driven by an additive white Gaussian noise. It always estimates 
the optimum step size, namely, the Kalman gain, for weight updating and thus 
the rate of convergence is significantly increased when compared to that of the 
gradient-descent method. The EKF approach is an incremental training method. 
It is a general method for training any feedforward network. The RLS method 
is a reduced form of the EKF method, and is more widely used in adaptation. 

When using EKF, training of an MLP is viewed as a parametric identification 
problem of a nonlinear system, where the weights are unknown and have to be 
estimated for the given set of training patterns. The weight vector W is now 
a state vector, and an operator f(-) is defined to perform the function of an 
MLP that maps the state vector and the input onto the output. The training 
problem can be posed as a state estimation problem with the following dynamic 
and observation equations 


w(t+1) = w(t) + v(t), (5.50) 


yı = f (w(t), x) + elt), (5.51) 


where æ+ is the input to the network at time t, y, is the observed (or desired) 
output of the network, and v and e are the observation and measurement noise, 
assumed white and Gaussian with zero mean and covariance matrices Q(t) and 


—— 


R(t), respectively. The state estimation problem is then to determine wW, the 
estimated weight vector, that minimizes the sum of squared prediction errors of 
all prior observations. 

EKF is a minimum variance estimator based on the Taylor-series expansion of 
f (w) in the vicinity of the previous estimate. The EKF method for estimating 
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w(t) is given by [74, 38, 84] 


w(t +1) = wt) + Kt +1) [Yet = Diril ’ (5.52) 
E(t +1) = F(t+1)P(t)F’ (¢+1)+R(¢+1), (5.53) 
K(t+1) = P@)FT 4+ DEV (t+ 1), (5.54) 

(5.55) 


P(t + 1) = P(t) — K(t + 1)F(t + 1)P(t) + Q(?), 


where K is the Nu x Ny Kalman gain, Ny is the dimension of y, P is the 
Nw X Nw conditional error covariance matrix, y is estimated output, and 


Of OY 
F(it+1)= = = —=— ; 
ree Jw w=w(t) Jw w=w(t) ve 


isan Ny x Ny matrix. 

In the absence of a priori information, the initial state vector can be set 
randomly and P can be set as a diagonal matrix: P(0) = I, w (0) = (0, P(0)), 
where € is a small positive number and MN (0, P(0)) denotes a zero-mean Gaussian 
distribution with covariance matrix P(0). A relation between the Hessian H and 
P is established in [84]. 

The method given above is the global EKF method for MLP training [82, 38]. 
Compared to BP, it needs far fewer training cycles to reach convergence and the 
quality of the solution is better, at the expense of a much higher computational 
complexity at each cycle. BP is proved to be a degenerate form of EKF [74]. The 
fading-memory EKF and a UD factorization based fading-memory EKF, which 
use an adaptive forgetting factor, are two fast algorithms for learning feedforward 
networks [99]. The EKF variant reported in [72] performs as efficiently as the 
LM method. 

In order to reduce the complexity of the EKF method, one can partition the 
global problem into many small-scaled, separate, localized identification subprob- 
lems for each neuron in the network so as to solve the individual subproblems. 
Examples of the localized EKF methods are the multiple extended Kalman algo- 
rithm [80] and the decoupled EKF algorithm [69]. The localized algorithms set to 
zero the off-diagonal terms of the covariance matrix in the global EKF method; 
this, however, ignores the natural coupling of the weights. 

The Kalman filtering method is based on the assumption of the noise being 
Gaussian, and is thus sensitive to noises of other distributions. According to 
the H% theory, the maximum energy gain that the Kalman filtering algorithm 
contributes to the estimation error due to disturbances has no upper bound [61]. 
The extended Hq filtering method, in the form of global and local algorithms, can 
be treated as an extension to EKF for enhancing the robustness to disturbances. 
The computational complexity of the extended Ha filtering method is typically 
twice that of EKF. 
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Recursive least squares 


When R(t + 1) in (5.54) and Q(t) in (5.55), respectively, reduce to the identity 
matrix I and zero matrix O of the same size, the EKF method is reduced to 
the RLS method. The RLS method is applied for learning of layered feedforward 
networks in [82, 4, 13, 48]. It is typically an order of magnitude faster than LMS, 
which is equivalent to BP in one-layer networks. For a given accuracy, RLS is 
shown to require tenfold fewer epochs than BP [13]. 

The RLS method is derived from the optimization of the energy function [48] 


T 


+ 
— 3 
al 
i 
el 


(0)]° P(0) [W (t) — wW (0)] . (5.57) 


When P(0) = +I and w(0) is a small vector, the second term in (5.57) reduces 
to ew(t)’ w(t). Thus, the RLS method is implicitly a weight-decay technique 
whose weight-decay effect is governed by P(0). Smaller £ usually leads to better 
training accuracy, while larger £ results in better generalization [48]. At iteration 
t, the Hessian for the above error function is related to the error covariance matrix 


P(t) by [48] 
H(t) ~ 2P! (t) — 2P~+(0). (5.58) 


The RLS method has a computational complexity of O (N2). 

A complex training problem can also be decomposed into separate, localized 
identification subproblems, each being solved by the RLS method [83, 63]. In 
the local linearized LS method [83], each subproblem has the objective function 
as the sum of the squares of the linearized backpropagated error signals for 
each neuron. In the block RLS algorithm [63], at a step of the algorithm, an 
M-layer feedforward network is divided into M — 1 subproblems, each being an 
overdetermined system of linear equations for each layer of the network. This is 
a considerable saving with respect to a global method. 

There is no explicit decay in the energy function in the RLS algorithm and 
the decay effect diminishes linearly with the number of training epochs [48]. In 
order to speed up the learning process as well as to improve the generalization 
ability of the trained network, a true weight-decay RLS algorithm [49] combines 
a regularization term for quadratic weight decay. The generalized RLS model [93] 
includes a general decay term in the energy function; it can yield a significantly 
improved generalization ability of the trained networks and a more compact 
network, with the same computational complexity as that of the RLS algorithm. 
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Natural-gradient descent method 


When the parametric space is not Euclidiean but has a Riemannian metric struc- 
ture, the ordinary gradient does not give the steepest direction of the cost func- 
tion, while the natural gradient does [1]. Natural-gradient descent exploits the 
natural Riemannian metric that the Fisher information matrix defines in the 
MLP weight space. 

Natural-gradient descent replaces standard gradient descent by 


W= W,- mIG(W:)] + Ve(x, Yt; W;,), (5.59) 


where e(æ, y; W) = (f(x; W) — y)? is the local square error and G(W) is the 
metric tensor when the MLP weight space is viewed as an appropriate Rieman- 
nian manifold. G coincides with Levenberg’s approximation to the Hessian of a 
square error function [34]. The most natural way to arrive at G is to recast MLP 
training as a log-likelihood maximization problem [1]. 

The online natural-gradient learning gives the Fisher efficient estimator [1], 
implying that it is asymptotically equivalent to the optimal batch procedure. 
This suggests that the flat-spot problem that appears in BP disappears when 
natural gradient is used [1]. According to the Cramer-Rao bounds, Fisher effi- 
ciency is the best asymptotic performance that any unbiased learning algorithm 
can achieve. However, calculation of the Fisher information matrix and its inverse 
is practically very difficult. An adaptive method of directly obtaining the inverse 
of the Fisher information matrix is proposed in [2]. It generalizes the adaptive 
Gauss-Newton algorithms, and the natural-gradient method is equivalent to the 
Newton method at around the optimal point. In batch natural-descent method, 
the Fisher matrix essentially coincides with the Gauss-Newton approximation 
of the Hessian of the MLP MSE function and the natural-gradient method is 
closely related to the LM method [34, 37]. Natural-gradient descent should have 
a linear convergence in a Riemannian weight space compared to the superlin- 
ear one of the LM method in the Euclidean weight space. A natural-conjugate 
gradient method for MLP training is discussed in [29]. 

In [34], natural gradients are derived in a slightly different manner and batch- 
mode learning and pruning are linked to existing algorithms such as LM opti- 
mization and OBS. The Rprop algorithm with the natural gradient [37] converges 
significantly faster than Rprop. It shows at least similar performance as the LM 
and appears to be slightly more robust. Compared to Rprop, in LM optimization 
and Rprop with the natural gradient, a weight update requires cubic time and 
quadratic memory, and both methods have additional hyperparameters that are 
difficult to adjust [37]. 
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Other learning algorithms 


The LP method is also used for training feedforward networks [81]. The LP 
method can be effective for training a small network and can converge rapidly 
and reliably to a better solution than BP. However, it may take too long a time 
for each iteration for very large networks. Some measures are considered in [81] 
so as to extend the method for efficient implementations in large networks. 


Layerwise linear learning 


Feedforward networks can be trained by iterative layerwise learning methods [5, 
78, 23, 73]. Weight updating is performed layer by layer, and weight optimization 
at each layer is reduced to solving a set of linear equations, Aw”) = b, where 
w'™) is a weight vector associated with the layer, and A and b are a matrix and 
a vector of suitable dimensions, respectively. These algorithms are typically one 
to two orders of magnitude faster in computational time than BP for a given 
accuracy. 

BP can be combined with a linear-algebra method [5] or Kalman filtering 
method [78]. In [5], sets of linear equations are formed based on the computation 
of target node values using inverse activation functions. The updated weights 
need to be transformed to ensure that target values are in the range of the 
activation functions. An efficient method that combines the layerwise approach 
and the BP strategy is given in [73]. This layerwise BP algorithm is more accurate 
and faster than the CG with Powell restarts and the Quickprop. 

In [19, 98], a fast algorithm for three-layer MLP, called OWO-HWO, a com- 
bination of hidden weight optimization (HWO) and output weight optimization 
(OWO) has been described. HWO is a batch version of the Kalman filtering 
method given in [78], restricted to hidden units. OWO solves a set of linear equa- 
tions to optimize the output weights. OWO-HWO is equivalent to a combination 
of linearly transforming the training data and performing OWO-BP [52], which 
uses OWO to update the output weights and BP to update the hidden weights. 
OWO-HWO is superior to OWO-BP in terms of convergence, and converges to 
the same training error as the LM method does in an order of magnitude less 
time [19, 98]. 

The parameterwise algorithm for MLP training [50] is on the basis of the idea 
of layerwise algorithm. It does not need to calculate the gradient of the error 
function. In each iteration, the weights or thresholds can be optimized directly 
one by one with the other variables fixed. The error function is simplified greatly 
by means of only calculating the changed part of the error function in the training 
process. In comparisons with BP-with-momentum and the layerwise algorithms, 
the parameterwise algorithm achieves more than an order of magnitude faster 
convergence. 
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Escaping local minima 


Conventional first-order and second-order gradient based methods cannot avoid 
local minima. The error surface of an MLP has a stair-step appearance with 
many very flat and very steep regions [36]. For the case of a small number 
of training examples, there is often a one-to-one correspondence between the 
individual training examples and the steps on the surface. The surface becomes 
smoother as the number of training examples is increased. In all directions there 
are flat regions extending to infinity, which makes line-search-related learning 
algorithms useless. 

Many strategies have been explored to reduce the chances of getting trapped 
at a local minimum. One simple and effective technique to avoid local minima 
in incremental learning is to present examples to the network in a random order 
from the training set during each epoch. Another way is to run the learning 
algorithms using initial values in different regions of the weight space, and then 
to find the best solution. This is especially useful for fast convergent algorithms 
such as the CG algorithm. 

The injection of noise into the learning process is an effective means for escap- 
ing from local minima. This also leads to a better generalization capability. Var- 
ious annealing schemes actually use this strategy. Random noise can be added 
to the input, to the desired output, or to the weights. The level of the added 
noise should be decreased as learning progresses. The three methods have the 
same effect, namely, the inclusion of an extra stochastic term in the weight vec- 
tor adaptation. A random step-size strategy implemented in [87] employs an 
annealing average step size. The large steps enable the algorithm to jump over 
local maxima/minima, while the small ones ensure convergence in a local area. 
An effective way to escape from local minima is realized by incorporating an 
annealing noise term into the gradient-descent algorithm [17]. This heuristic has 
also been used in SARprop. 

Weight scaling [27, 71] is a technique used for escaping local minima and 
accelerating convergence. Using the weight scaling process, the weight vector to 
each neuron wi” is scaled by a factor a, where a € (0,1) is decided by 


a relation of the degree of saturation at each node pe = oe” — 0.5| (with 





i being the output of the node), the learning rate, and the maximum error 
at the output nodes. Weight scaling effectively reduces the degree of saturation 
of the activation function and thus maintains a relatively large derivative of 
the activation function. This enables relatively large weight updates, which may 
eventually lead the training algorithm out of a local minimum. 

The natural way to implement Newton’s methods is to confine a quadratic 
approximation of the objective function E (W) to a trust region. The trust-region 
subproblem is then solved to obtain the next iteration. The attractor-based trust- 
region method is an alternating two-phase algorithm for MLP learning [46]. The 
first phase is a trust-region based local search for fast training of the network and 
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global convergence, while the second phase is an attractor-based global search 
for escaping local minima utilizing a quotient gradient system. The trust-region 
subproblem is solved by applying Powell’s dogleg trust-region method [25]. The 
algorithm outperforms BP with momentum, BP with tunneling, and LM algo- 
rithms [46]. 

Stochastic learning algorithms typically have low convergence when compared 
to BP, but they can generalize better and effectively avoid local minima. Besides, 
they are flexible in network topology, error function, and activation function. 
Gradient information is not required. There are also many heuristic global opti- 
mization techniques such as evolutionary algorithm for MLP learning. 


Complex-valued MLPs and their learning 


In the real domain, common nonlinear transfer functions are the hyperbolic tan- 
gent and logistic functions, which are bounded and analytic everywhere. Accord- 
ing to Liouville’s theorem, a complex transfer function, which is both bounded 
and analytic everywhere, has to be a constant. As a result, designing a neu- 
ral network for processing complex-valued signals is a challenging task, since 
a complex nonlinear activation function cannot be both analytic and bounded 
everywhere in the complex plane C. The Cauchy-Riemann equations are neces- 
sary and sufficient conditions for a complex function to be analytic at a point 
z E€ C. Complex-valued neural networks are useful for processing complex-valued 
data, such as equalization and modeling of nonlinear channels. Digital channel 
equalization can be treated as a classification problem. 
The error function E for training complex MLPs is defined by 


1 N 
E= 5 6p €p (5.60) 
p=l 


where ep is defined by (4.6), but it is a complex-valued vector. E is not analytic 
since it is a real-valued function. 


Split complex BP 


The conventional approach for learning complex-valued MLPs selects split com- 
plex activation functions. Each split complex function consists of a pair of real 
sigmoidal functions marginally processing the inphase and quadrature compo- 
nents. Based on this split strategy, the complex version of BP is derived for 
complex-valued MLPs [47, 11, 62, 85]. This approach can avoid the unbounded- 
ness of fully complex activation functions. However, the split complex activation 
function cannot be analytic. 

The split complex BP uses the split derivatives of the real and imaginary 
components instead of relying on well-defined fully complex derivatives. The 
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derivatives cannot fully exploit the correlation between the real and imaginary 
components of the weighted sum of the input vectors. In the split approach, the 
activation function is split by 


(2) = or (R(z)) + jor (S(2)), (5.61) 


Ty, x,w € C7 are the J-dimensional 


complex input and weight vectors, respectively. Typically, dr(-) and ¢7(-) are 
selected to be the same sigmoidal function, one can select ér(x) = ġr(£) = x + 
ao Ssin(7a), where constant ag € (0, 1/7) [96]. This split complex function satisfies 
most properties of complex activation functions. This method can reduce the 


where z is the net input to a neuron, z = w 


information redundancy among hidden neurons of a complex MLP, and results 
in a guaranteed weight update when the estimation error is not zero. 

The sensitivity of a split-complex MLP due to the errors of the inputs and 
the connection weights between neurons is statistically analyzed in [94]. The 
sensitivity is affected by the number of the layers and the number of the neurons 
adopted in each layer, and an efficient algorithm to estimate the sensitivity is 
developed. When an MLP is trained with split-complex BP, it has a relatively 
strong dependence of the performance on the initial values. For the effective 
adjustments of the weights and biases in split-complex BP, the range of the 
initial values should be greater than that of the adjustment quantities [95]. This 
criterion can reduce the misadjustment of the weights and biases. The estimated 
range of the initial values gives significantly improved performance. 


Fully complex BP 


Fully complex BP is derived based on a suitably selected complex activation 
function [28, 42]. In [28], an activation function is defined 
zZ 


e) = — (5.62) 


co + żle 


where co and ro are real positive constants. The function ¢(-) maps a point z on 
the complex plane to a unique point $(z) on the open disc {z : |z| < ro}, with the 
same phase angle, and cp controls the steepness of |(z)|. This complex function 
satisfies most of the properties for activation function, and a circuit for such a 
complex neuron is designed in [28]. 

In [42], fully complex BP [28] is simplified by using the Cauchy-Riemann equa- 
tions. It is shown that fully complex BP is the complex conjugate form of BP 
and that split complex BP is a special case of fully complex BP. This generaliza- 
tion is possible by employing elementary transcendental functions (ETFs) that 
are almost everywhere bounded and analytic in C. Complex ETFs provide well- 
defined derivatives for optimization of the fully complex BP algorithm. A list of 
complex ETFs, including sin z, tan z, sinh z and tanh z, are suitable nonlinear 
activation functions in [42]. These ETFs provide a parsimonious structure for 
processing data in the complex domain. Fully complex MLPs with these ETFs 
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as activation functions are proved to be universal approximators in the complex 
domain [43]. 

Fully complex normalized BP [32] is an improvement on the complex BP [28] 
obtained by including an adaptive normalized learning rate. This is achieved 
by performing a minimization of the complex-valued instantaneous output error 
that has been expanded via a Taylor-series expansion. The method is valid for 
any complex activation function discussed in [42]. 

The minimization criteria used in the complex-valued BP learning algorithms 
do not approximate the phase of complex-valued output well in function approxi- 
mation problems. The phase of a complex-valued output is critical in telecommu- 
nications, and reconstruction and source localization problems in medical imag- 
ing applications. In [77], the convergence of complex-valued neural networks are 
investigated using a systematic sensitivity study, and the performance of differ- 
ent types of split complex-valued neural networks is compared. A complex-valued 
BP algorithm with logarithmic performance index is proposed with exponential 
activation function f(z) = exp(z); it directly minimizes both the magnitude and 
phase errors and also provides better convergence characteristics. The expo- 
nential function is entire since f(z) = f'(z) = exp(z) in C. It has an essential 
singularity at +00. By restricting the weights of the network to a small ball of 
radius and the number of hidden neurons to a finite value, the bounded behavior 
in fully complex-valued MLP can be achieved. 

Many other algorithms for training complex-valued MLPs are typically com- 
plex versions of some algorithms used for training real-valued MLPs. Split- 
complex EKF [70] has faster convergence than split complex BP [47]. Split- 
complex RProp [41] outperforms split complex BP [96], and fully complex BP 
[42] in terms of the computational complexity, convergence speed, and accuracy. 


Example 5.3: Handwritten digit recognition can be commercially used in 
postal services [45] or in banking services to recognize handwritten digits on 
envelopes or bank cheques. This is a real image-recognition problem. The input 
consists of black or white pixels. The mapping is from the two-dimensional image 
space into ten output classes. 

The MNIST data set contains handwritten digits. The training set is 9298 
segmented numerals digitized from handwritten postal codes that appeared on 
real U.S. mail. Additional 3349 printed digits coming from 35 different fonts are 
also added to the training set. Around 79% of the training set is used for training 
and the remaining 21% for testing. The size of the characters is normalized 
to a 16 x 16 pixel image by using a linear transformation. Due to the linear 
transformation, the resulting image is not binary but has multiple gray levels. 
The gray-leveled image is further scaled to the range of —1 to 1. Some samples 
of normalized digits from the testing set are shown in Fig. 5.5. 

A large six-layer MLP (784-576-576-768-192-10) trained with BP is designed 
to recognize handwritten digits. Weight sharing and other heuristics are used 
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Figure 5.5 Samples of normalized digits from the testing set (From [45]). 


among the hidden layers. The input layer uses a 28 x 28 plane instead of a 16 x 
16 plane to avoid problems arising from a kernel overlapping a boundary. The 
network has 4635 nodes, 98442 connections, and 2578 independent parameters. 
This architecture was optimized by using the OBD technique. The large network 
with weight sharing was shown to outperform a small network with the same 
number of free parameters. 

An incremental BP algorithm based on an approximate Newton’s method was 
employed, and the training time for 30 epochs through the training set plus 
test. The error rates on the training set and testing set were, respectively, 1.1% 
and 3.4%. When a rejection criterion was employed to the testing set, the result 
was 9% rejection rate for 1% error. These rejected samples may be due to fault 
segmentation, or ambiguous writing even to humans. 


Example 5.4: The Mackey-Glass differential delay equation describes a time 
series system with chaotic behavior: 
dx(t) 
dt 


0.2%(t — T) 
1l+a(t—7)19° 





= —0.1ax(t) + 


The training data are collected by integrating the equation using Euler’s method 
with 7 = 17 and a step size of 1. Assume z(18) = 1.2, 7 = 17, and x(t) = 0 for 
t < 18. The samples are shown in Fig. 5.6a. 

The task is to predict a future point based on currently available time-series 
points by using MLP with BFGS. Three time-series points x(t), x(t — 1), and 
x(t — 2) are considered as network input while the output is the sample x(t + 1). 
We use 3 nodes in the hidden layer. We use the first 500 samples for training, 
and the remaining samples for testing. The training reaches an accuracy of 0.01 
with 22 iterations or 2 s. The prediction error is shown in Fig. 5.6. It is shown 
that after training with the first samples, the prediction is sufficient accuracy for 
the remaining samples. 
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Figure 5.6 Prediction of the Mackey-Glass time-series. 


Problems 


5.1 Show that inclusion of a momentum term in the weight update in the BP 
algorithm can be considered as an approximation to the CG method. 


5.2 Learn the function f(x) = xsinaz, 0 <0 <7 using MLP with the BP and 
BFGS algorithms. Generate random data sets for training and for testing. 


5.3 Matrix inversion lemma given below is usually used for the derivation of 
the iterative algorithm: 


(A +BC) = A~! — AB(I + CAB) ICA}, 


where I is the identity matrix, and A, B, C are matrices of proper size. 

(a) Prove matrix inversion lemma. [Hint: premultiply or postmultiply both sides 
by A+ BC] 

(b) For MLP, the Hessian matrix for a data set of N patterns is given by a 
diagonal approximation [33] 


N 


Hy = 5 InIn> 


n=1 
where gn = VwEn, En being the sum-of-squared error for all n patterns. This 
expression can be written as 


Ans: = Hn + 050 y1: 
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Using matrix inverse lemma, show that the inverse Hessian is given by the fol- 
lowing interative expression: 


H;i = H7! _ HG 41S Hn 

1+ gia Hn Iny 

5.4 Train MLP to solve the character-recognition problem. The ASCH charac- 
ters are printed in 5 x 7 dot-matrix. 

(a) Design an MLP network to map the dot-matrices of characters ‘A’~Z’ and 
‘a’—‘z’ to their corresponding ASCII codes. 

(b) Design an MLP network to map the dot-matrices of 10 digits to their corre- 
sponding ASCII codes. 

(c) Describe the network’s tolerance to noisy inputs after training is complete. 


5.5 For two classes and a single network output, the cross-entropy error func- 
tion is given by 


E=-)- je Iny™ + (1-4) n(a — y™)]. 
n 
Derive the Hessian matrix. As training proceeds, the network output approxi- 
mates the conditional averages of the target data, and some terms in the matrix 
commponents vanish. 


5.6 The two spirals data set is a standard benchmark for classification algo- 
rithms. Two classes of points in a two-dimensional surface are arranged as inter- 
locked spirals. Separate the two spirals using the MLP network. 
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6 Hopfield networks, simulated 
annealing and chaotic neural 
networks 


6.1 Hopfield model 


The Hopfield model [27, 28] is the most popular dynamic model. It is biologically 
plausible since it functions like the human retina [36]. It is a fully interconnected 
recurrent network with J McCulloch-Pitts neurons. The Hopfield model is usu- 
ally represented by using a J-J layered architecture, as illustrated in Fig. 6.1. 
The input layer only collects and distributes feedback signals from the output 
layer. The network has a symmetric architecture with a symmetric zero-diagonal 
real weight matrix, that is, wij = wji and wi = 0. Each neuron in the second 
layer sums the weighted inputs from all the other neurons to calculate its current 
net activation net;, then applies an activation function to net; and broadcasts 
the result along the connections to all the other neurons. In the figure, w;; = 0 
is represented by a dashed line; @(-) and @ are, respectively, a vector comprising 
the activation functions for all the neurons and a vector comprising the biases 
for all the neurons. 

The Hopfield model operates in an unsupervised manner. The dynamics of the 
network are described by a system of nonlinear ordinary differential equations. 
The discrete form of the dynamics is defined by 
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Figure 6.1 Architecture of the Hopfield network. 
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where net; is the weighted net input of the ith neuron, «;(t) is the output of the 

ith neuron, 0; is a bias to the neuron, and ¢(-) is the sigmoidal function. The 

discrete time variable ¢ in (6.1) and (6.2) takes values 0,1,2,... 
Correspondingly, the continuous-time Hopfield model is given by 


ti( 
ane) -È wjixj(t (6.3) 


xilt) = d(net;(t)), (6.4) 


where t denotes the continuous-time variable. 
In order to characterize the performance of the network, the concept of energy 
is introduced and the following energy function is defined [27]: 


ILJ J 
E= =a 5 Wij titi — 5 Oiti 
i=1 j=1 i=1 
1 
= —50 We — a9, (6.5) 
where a = (£1, %2,... xy) is the input and state vector, and 0 = 


(01, 02,..., 03)" is the bias vector. 


Theorem 6.1 (Stability). The continuous-time Hopfield network always con- 
verges to a local minimum. 


The proof is sketched here. From (6.5), 


dE = = 2 -DE dE dr; dnet; 
dt dx; dnet; dt. 


{=l 


where 2 =— ps Wji£j + Pi - ae . Thus, 
dE 7 3 dnet; R dzi 
dt = dt dnet; 


Since the sigmoidal al @ (neti) is monotonically increasing, + 











is always 
positive. Thus, we have 4% oe Al, According to Lyapunov’s theorem, “the Hopfield 
network always ee Ne a local minimum, and thus is stable. The dynamic 
equations of the Hopfield network actually implement a gradient-descent algo- 
rithm based on the cost function E [28]. 

The updating of all neurons of the Hopfield model can be carried out syn- 
chronously (Little dynamics) at each time step or asynchronously (Glauber 
dynamics), updating them one at a time. The Hopfield model is asymptotically 
stable when running in asynchronous or serial update mode. In asynchronous 
mode, only one neuron at a time updates itself in the output layer, and the 
energy function either decreases or stays the same after each iteration. However, 
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Hopfield Network State Space 
16 r 1 J 














Figure 6.2 Illustration of the state space of Hopfield network. 


6.2 


if the Hopfield memory is running in the synchronous or parallel update mode, 
that is, all neurons update themselves at the same time, it may not converge to 
a fixed point, but may instead become oscillatory between two states [41, 7, 12]. 
The Hopfield network with the signum activation has a smaller degree of free- 
dom compared to that using the sigmoidal activation, since it is constrained to 
changing the states along the edges of a J-dimensional hypercube O = {—1, 1}. 
The use of sigmoidal functions helps in smoothing out some of the local minima. 
Due to recurrence, the dynamics of the network are described by a system 
of ordinary differential equations and by an associated energy function to be 
minimized. The Hopfield model is a dynamic model that is suitable for hardware 
implementation and can converge to the result in the same order of time as the 
circuit time constant. The Hopfield network can be used for converting analog 
signals into digital format, for associative memory and for solving COPs. 


Example 6.1: We store two stable points (—1, —1), (1, 1) as the two fixed points. 
For random states, they will finally converge to one of the two states, and this 
process is shown in Fig. 6.2. 


Continuous-time Hopfield network 
The high interconnection in physical topology makes the Hopfield network espe- 


cially suitable for analog VLSI implementation. The convergence time of the 
network dynamics is decided by a circuit time constant, which is of the order of 
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Figure 6.3 A circuit for neuron i in the Hopfield model. 


a few nanoseconds. The Hopfield network can be implemented by interconnecting 
an array of resistors, nonlinear operational amplifiers with symmetrical outputs, 
capacitors, and external bias current sources. Each neuron can be implemented 
by a capacitor, a resistor, and a nonlinear amplifier. A current source is necessary 
for representing the bias. The circuit structure of the neuron is shown in Fig. 6.3. 
vi, i = 1,..., J, is the output voltage of neuron i, J; is the external bias current 
source for neuron 7, u; is the voltage at the interconnection point, C; is a capac- 
itor, and Rix, k = 0,1,..., J, are resistors. The sigmoidal function ¢(-) is used 
as the transfer function of the amplifiers. A drawback of the Hopfield network 
is the necessity to update the complete set of network coefficients caused by the 
signal change, and this causes difficulties in its circuit implementation. 

By applying Ohm’s law and Kirchhoff’s current law to the ith neuron, we 
obtain 





J 
= =A H; (6.6) 
where v; = ¢ (ui), @(u;) is the sigmoidal function, and 
J 
1 1 
Ri Rio 7 2 


Gij being the conductance. In the circuits shown in Fig. 6.3, the inverting output 








1 
o Gio +X Gi, (6.7) 


of each neuron is used to generate negative weights since the conductance Gj; is 
always positive. 
Equation (6.6) can be written as 


J 
du; 
TIE = —Qiui T 2 Wjitj + 0; 





= —Qiui T wie + 0i, (6.8) 


where 2; =v; = d(u;) is the input signal, 7; =7;C; is the circuit time con- 





stant, r; is a scaling resistance, a; = J+ is a damping coefficient of the inte- 


grator, which forces the internal signal u; to zero for a zero input, Wji = 
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Figure 6.4 Continuous-time Hopfield structure. 
z+ =riGi is the synaptic weight, 6; = ril; is the external bias signal, and 
Wi = (Wii, Wii,---; wyi)” is the weight vector feeding neuron i. Equation (6.8) 
is known as the continuous-time Hopfield network, and its circuit is shown in 
Figure 6.4. 
The dynamics of the whole network can be written as 


du 

i= —au+W' «+9, (6.9) 
where the circuit time constant matrix T = diag (71, 72,...,7,), the interconnect- 
point voltage vector u = (u1, U2,..., uz) the damping coefficient matrix a = 
diag (a1, Q2,...,@, ), the input and state vector # = (£1, £2,... ty), the bias 

vector 0 = (61, 62,... ,03)", and the J x J weight matrix W = [wiw2... wy]. 

At the equilibrium of the system, gu = 0, thus 

au = Wz +0. (6.10) 


The dynamics of the network are controlled by C; and Rij. A sufficient condition 
for the Hopfield network to be stable is that W is a symmetric matrix with diag- 
onal elements being zero [28]. The stable states correspond to the local minima 
of the Lyapunov function [28] 


J zi 
E(x) = —50"We -0+5 a f b *(E)d€, (6.11) 
i=l 


where ¢1(-) is the inverse of ¢(-). Equation (6.11) is a special case of the Cohen- 
Grossberg model [16]. 

From (6.8) and (6.3), it is seen that a is zero in the basic Hopfield model. This 
term corresponds to an integral related to ¢~'(-) in the energy function. When 
the gain of the sigmoidal function 3 — oo, that is, when the sigmoidal function 
is selected as the hard-limiter function and the nonlinear amplifiers function as 
switches, the integral terms are insignificant and E(x) in (6.11) approaches (6.5). 
In this case, the circuit model is exact for the basic Hopfield model. The stable 
states of the basic Hopfield network are the corners of the hypercube, namely, the 
local minima of (6.5) are in {—1, +1}/ [28]. For large but finite gains, a sigmoidal 
function leads to a large positive contribution near the hypercube boundaries, 
but to a negligible contribution far from the boundaries. This leads to an energy 
surface that still has its maxima at the corners, but the minima slightly move 
inward from the corners of the hypercube. As 8 decreases, each minimum moves 
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further inward and disappears one at a time. When ( gets small enough, the 
energy minima start to disappear. 

Discrete-time symmetric Hopfield networks are essentially as powerful com- 
putationally as general assymmetric networks, despite their Lyapunov-function 
constrained dynamics. In the binary-state case, symmetric networks can simu- 
late assymmetric ones with only a linear increase in the network size and in the 
analog-state case, finite symmetric networks are also Turing universal, provided 
they are supplied with a simple external clock to prevent them from converg- 
ing. The energy minimization problem remains NP-hard for analog networks 
also [55]. Continuous-time symmetric Hopfield networks are capable of general 
computation. Such networks have very constrained Lyapunov-function controlled 
dynamics. They are universal and efficient computational devices: any convergent 
synchronous fully parallel computation by a recurrent network of n discrete-time 
binary neurons, within general asymmetric coupling-weights, can be simulated 
by a symmetric continuous-time Hopfield network containing only 18n + 7 units 
employing the saturated-linear activation function [56]. In terms of standard dis- 
crete computation models, any polynomially space-bounded Turing machine can 
be simulated by a family of polynomial-size continuous-time symmetric Hopfield 
nets [56]. 


Linear system in a saturated mode 

For the realization given in Fig. 6.3, it is not possible to independently adjust 
the network parameters, since the coefficient a; is nonlinearly related to all the 
weights Wij. 

In another circuit implementation of the Hopfield model [40], a; are removed 
by replacing the integrators and nonlinear amplifiers in the previous model by 
ideal integrators with saturation. The circuit of such a neuron is illustrated in 
Fig. 6.5. Notice that the integrator and the nonlinear amplifier in Fig. 6.3 are 
replaced by an ideal integrator with saturation; Vee is the power voltage. The 
dynamic equation of this neuron can be described by 


J 
du; 1 1 
moa (Lats) we 


where |v;| < 1. Comparing (6.12) with (6.3), we have wji = ORG and 6; = as 
This model is referred to as a linear system in a saturated mode, which retains 
the basic structure of the Hopfield model and is easier to analyze, synthesize and 
implement than the Hopfield model. The energy function of the model is exactly 
the same as (6.5). 
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Figure 6.5 A modified circuit for neuron i in the Hopfield model. 


6.3 


Simulated annealing 


Annealing is referred to as tempering certain alloys of metal by heating and then 
gradually cooling them. The simulation of this process is known as simulated 
annealing. A metal is first heated above its melting point and then cooled slowly 
until it solidifies into a perfect crystalline structure. The defect-free crystal state 
corresponds to the global minimum energy configuration. The Metropolis algo- 
rithm is a simple method for simulating the evolution to the thermal equilibrium 
of a solid for a given temperature [46]. Simulated annealing [34] is a variant of 
the Metropolis algorithm, where the temperature is changing from high to low. 
It is a descent algorithm modified by random ascent moves in order to escape 
local minima which are not global minima. The annealing algorithm simulates a 
nonstationary finite state Markov chain whose state space is the domain of the 
cost function to be minimized. The idea of annealing is a general optimization 
principle. 

Simulated annealing is a general, serial algorithm for finding a global mini- 
mum. The solutions by this technique are close to the global minimum within 
a polynomial upper bound for the computational time and are independent of 
the initial conditions. Simulated annealing is a popular Monte Carlo algorithm 
for combinatorial optimization. Some parallel algorithms for simulated annealing 
have been proposed aiming to improve the accuracy of the solutions by applying 
parallelism [18]. 

According to statistical thermodynamics, P,, the probability of a physical 
system being in state a with energy Ea at temperature T satisfies the Boltzmann 
distribution (also known as the Boltzmann-Gibbs distribution): 


TBT, (6.13) 


where kpg is Boltzmann’s constant, T is the absolute temperature, and Z is the 
partition function, defined by 


E 
Z=% eT, (6.14) 
B 


the summation being taken over all states Ø with energy Eg at temperature T. 
At high T, the Boltzmann distribution exhibits uniform preference for all the 
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Algorithm 6.1 (Simulated annealing). 


1. Initialize the system configuration. Randomize æ(0). 
2. Initialize T with a large value. 
3. Repeat until T is small enough: 
— Repeat until the number of accepted transitions is below a threshold: 
a. Apply random perturbations x = x + Aa. 
b. Evaluate AE(x) = E(x + Ax) — E(x): 
* if AE(x) <0, keep the new state; 
* otherwise, accept the new state with probability P =e 
— SetT=T— AT. 


_ AE 
T 


states, regardless of the energy. When T approaches zero, only the states with 
minimum energy have nonzero probability of occurrence. 

In simulated annealing, we omit the constant kg. T is a control parameter 
called the computational temperature, which controls the magnitude of the per- 
turbations of the energy function E(x). At high T, the system ignores small 
changes in the energy and approaches thermal equilibrium rapidly, that is, it 
performs a coarse search of the space of global states and finds a good mini- 
mum. As T is lowered, the system responds to small changes in the energy, and 
performs a fine search in the neighborhood of the already determined minimum 
and finds a better minimum. At T = 0, any change in the system states does not 
lead to an increase in the energy, and thus, the system must reach equilibrium 
ifT=0. 

When performing simulated annealing, theoretically a global minimum is guar- 
anteed to be reached with a high probability. The artificial thermal noise is grad- 
ually decreased in time. The probability of a state change is determined by the 
Boltzmann distributions of the energy difference of the two states 

P=”, (6.15) 
The probability of uphill moves in the energy function (AE > 0) is large at high 
T, and is low at low T. Simulated annealing allows uphill moves in a controlled 
fashion: It attempts to improve on greedy local search by occasionally taking a 
risk and accepting a worse solution. It can be performed by Algorithm 6.1 [34]. 

Classical simulated annealing is known as Boltzmann annealing. The cooling 
schedule for T is critical to the efficiency of the algorithm. If T is reduced too 
rapidly, a premature convergence to a local minimum may occur. In contrast, 
if it is reduced too slowly, the algorithm is very slow to converge. Based on a 
Markov-chain analysis on the simulated annealing process, a simple necessary 
and sufficient condition on the cooling schedule for the algorithm state to con- 
verge in probability to the set of globally minimum cost states is that T must 


ww ai bt. com DOOOO00 


Hopfield networks, simulated annealing and chaotic neural networks 173 


be decreased by [24] 


T, 
2 t=1,2 


TO 2 aro TERE 


(6.16) 
to ensure convergence to the global minimum with probability one, where To is 
an sufficiently large initial temperature. In [25], To is proved to be greater than 
or equal to the depth, suitably defined, of the deepest local minimum which is 
not a global minimum state. In other words, in order to guarantee the Boltzmann 
annealing to converge to the global minimum with probability one, T(t) is needed 
to decrease logarithmically with time. This is practically too slow. In practice, 
one usually applies a fast schedule T(t) = aT(t— 1) with 0.85 < a < 0.96, to 
achieve a suboptimal solution. 

Classical simulated annealing is a slow stochastic search method. The search 
has been accelerated in Cauchy annealing [58], simulated reannealing [30], gen- 
eralized simulated annealing [61], and simulated annealing with known global 
value [43]. Some VLSI designs of simulated annealing are also available [36]. 

In Cauchy annealing [58], the Cauchy distribution, also known as the Cauchy- 
Lorentz distribution, is used to replace the Boltzmann distribution. The infinite 
variance provides a better ability to escape from local minima and allows for the 
use of faster schedules, such as T decreasing by T(t) = 2. A stochastic neural 
network trained with Cauchy annealing is also called a Cauchy machine. Gen- 
eralized simulated annealing [61] generalizes Cauchy annealing and Boltzmann 
annealing within a unified framework inspired by the generalized thermostatis- 
tics. In simulated reannealing [30], T decreases exponentially with t. 

In the fuzzy annealing scheme [50], fuzzification is performed by adding an 
entropy term. The fuzziness at the beginning of the entire procedure is used to 
prevent the optimization process from getting stuck at an inferior local opti- 
mum. Fuzziness is reduced step by step. Fuzzy annealing results in an increase 
in computation speed by a factor of one hundred or more compared to simulated 
annealing [50]. 

Deterministic annealing [51, 52] is a method where randomness is incorpo- 
rated into the energy or cost function, which is then deterministically optimized 
at a sequence of decreasing temperature. The approach is derived within the 
framework of information theory and probability theory. The annealing process 
is equivalent to the computation of Shannon’s rate-distortion function, and the 
annealing temperature is inversely proportional to the slope of the curve. The 
application-specific cost is minimized subject to a constraint on the randomness 
(Shannon entropy) of the solution, which is gradually lowered. The iterative 
procedure is monotonely non-increasing in the cost function. Unlike simulated 
annealing, it is a deterministic method that replaces stochastic simulations by the 
use of expectation. It has been used for nonconvex optimization problems such 
as clustering, MLP training, and RBF network training [51, 52]. The reduced- 
complexity deterministic annealing algorithms [20] use simple low-complexity 
distributions to mimic the Gibbs distribution used in standard deterministic 
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annealing, yielding an acceleration of over 100 times with negligible performance 
difference for vector quantizer design. 

Parallel simulated annealing takes advantage of parallel processing. In [6], 
each of a fixed set of samplers operates at different temperature. A solution that 
costs less is propagated from the higher temperature sampler to the neighboring 
sampler operating at a lower temperature. Therefore, the best solution at a given 
time is propagated to all samplers operating at a lower temperature. Sample-Sort 
[60] has a fixed set of samplers each operating at different static temperatures. 
The set of samplers uses a biased generator to sample the same distribution of a 
serial simulated annealing algorithm to maintain the same convergence property. 
It propagates a less-cost solution to other samplers, but does it probabilistically 
by permitting the samplers to exchange solutions with neighboring samplers. 
The samplers are lined up in a row and exchange solutions with samplers that 
are one or more hops away. It adjusts the probability of accepting a higher cost 
solution dependent on the temperature of the neighboring sampler. 

Multiobjective simulated annealing uses the domination concept and the 
annealing scheme for efficient search [48]. In [57], the proposed multiobjective 
simulated annealing maps the optimization of multiple objectives to a single- 
objective optimization using the true tradeoff surface, maintaining the conver- 
gence properties of simulated annealing, while encouraging exploration of the 
full tradeoff surface. 


Hopfield networks for optimization 


The Hopfield network is usually used for solving optimization problems. In gen- 
eral, the continuous model is superior to the discrete one in terms of the local 
minimum problem, because of its smoother energy surface. Hence, the continu- 
ous Hopfield network has dominated the techniques for optimization problems, 
especially for combinatorial problems [29]. 

From the computational aspect, the operation of Hopfield network for an opti- 
mization problem manages a dynamic system characterized by an energy func- 
tion, which is a combination of the objective function and constraints of the 
original problem. Three common techniques, namely penalty functions, Lagrange 
multipliers, and primal and dual methods, are utilized to construct an energy 
function. These techniques are suitable for solving various optimization problems 
such as LP, nonlinear programming and mixed-integer LP. 

For the penalty method, there is always a compromise between good-quality 
solution and convergence. For a feasible solution, the weighting factors for the 
penalty terms should be sufficiently large, which however causes the constraints 
on the original problem to become relatively weaker, resulting in a deteriora- 
tion of the quality of the solution. A trial-and-error process for choosing some of 
the penalty parameters is inevitable in order to obtain feasible solutions. More- 
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over, the gradient-descent method often leads to a local minimum of the energy 
landscape. 

The local minima of (6.5) correspond to the attractors in the phase space, 
which are nominal memories of the network. A large class of COPs can be 
expressed in this form of QP optimization problems, and thus can be solved 
using the Hopfield network. The Hopfield network can be used as an effective 
interface between analog and digital devices, where the input signals to the net- 
work are analog and the output signals are discrete values. The neural interface 
has the capability of learning. The neural-based analog-to-digital (A/D) con- 
verter adapts to compensate for initial device mismatches or long-term drifts 
[59]. 

Many neural network models for linear, quadratic programming, least squares, 
and many matrix algebraic, constrained optimization, discrete and combinatorial 
optimization problems are described in [15] 


Combinatorial optimization problems 


Any problem that has a large set of discrete solutions and a cost function for 
rating those solutions relative to one another is a COP. COPs are known to 
be NP-complete, namely, nondeterministic polynomial-time complete. In COPs, 
the number of solutions grows exponentially with n, the size of the problem, at 
O(n!) or O(e”) such that no algorithm can find the global minimum solution 
in polynomial computational time. The goal for COPs is to find an optimal 
solution or sometimes a nearly optimal solution. Exhaustive search of all the 
possible solutions for the optimum is impractical. 

The Hopfield network can be effectively used to deal with COPs with the objec- 
tive functions of the linear or quadratic form, linear equalities and/or inequali- 
ties as the constraints, and binary variable values so that the constructed energy 
function can be of quadratic form. It can be used to solve the two well-known 
COPs: traveling salesman problem (TSP) and the location-allocation problem. 


Traveling salesman problem (TSP) 

TSP is the most notorious NP-complete problem. The definition is simple: Find 
the shortest closed-path through all points. Given a set of points, either nodes on 
a graph or cities on a map, find the shortest possible tour that visits every point 
exactly once and then returns to its starting point. There are (n — 1)!/2 possible 
tours for an n-city TSP. For symmetric TSPs, the distances between nodes are 
independent of the direction, i.e., dj; = dj; for every pair of nodes. In the asym- 
metric TSP, at least one pair of nodes satisfies dj; A dji. The Hopfield network 
was the first neural network used for TSP, and it achieves a near-optimum solu- 
tion [29]. TSP arises in numerous optimization problems, from routing of wires 
on a printed circuit board, to VLSI circuit design, to fast food delivery, to parts 
placement in electronic circuits, to routing in communication network systems, 
and to resource planning. 
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For TSP, any individual city is indicated by the output states of a set of n 
neurons, and can be in any one of the n positions in the tour list. For n cities, a 
total of n independent sets of n neurons are needed to describe a complete tour. 
Hence, this is a total of N = n? neurons which are displayed as an n x n square 
array for the network. Since the representation of neural outputs of the network 
in terms of n rows of n neurons, the N symbols of outputs will be represented 
by double indices xxj, denoting the Xth city in the jth position in a tour. To 
permit the N neurons in the network to compute a solution to the problem, 
the lowest value of the energy function corresponds to the best path. The space 
over which the energy function is minimized in this limit is the 2” corners of the 
N-dimensional hypercube. A benchmark set for the TSP community is TSPLIB, 
a growing collection of sample instances. 

The problem can be described as: 


min) > So dxyexi(ryiss + ty 4-1) (6.17) 
X YAX i 


subject to 


> es = 0, (6.18) 
x 


i 4964 


[> 5 LXiTYi = 0, (6.19) 


i X XAY 


(= X axi- n) =0. (6.20) 
X i 


where all indices X, Y, i,j run from 1 to n. The objective is to find the shortest 
tour. The first constraint is satisfied if and only if each city row X contains no 
more than one 1, i.e., the rest of the entries should be zero. The second constraint 
is satisfied if and only if each “position in tour” column contains no more than 
one 1, i.e., the rest of the entries are zero. The third constraint is satisfied if and 
only if there are n entries of one in the entire matrix. 

Consider those corners of this space which are the local minima of the energy 
function [29] 


2 
Eee ee ee (x - n) 
x xX i 


i jf i X X#Y 
+ MT DT De dxyexilevi + tyi), (6.21) 
X YŁX i 


where 1, A2, A3, A4 are positive parameters, chosen by trial and error for a 
particular problem. The first three terms describe the feasibility requirements 
and result in a valid tour by a value of zero [29]. The last term represents the 
objective function of TSP. 
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The multiple TSP is a generalization of TSP: given a set of intermediate 
cities and a depot, m salesmen must visit all intermediate cities according to 
the constraints that the route formed by each salesman must start and end at 
the depot; each intermediate city must be visited once and by a single salesman; 
and the cost of the routes must be minimum. Multiple TSP is NP-complete as 
it includes TSP as a special case. It can be applied to vehicle routing and job 
scheduling. 

A Lagrange multiplier and Hopfield-type barrier function method is proposed 
in [19] for approximating a solution of TSP. The method is more effective and 
efficient than the soft-assign algorithm. The introduced Hopfield-type barrier 
term is given by [28, 19] 


d(x;;) = Tij ln Tij = (1 = Liz) In(1 = Liz) (6.22) 


to incorporate 0 < xij < 1 into the objective function. 


Location-allocation problem 
The location-allocation problem can be stated as follows. Given a set of facilities, 
each of which serves a certain number of nodes on a graph, the objective is to 
place the facilities on the graph so that the average distance between each node 
and its serving facility is minimized. 

A class of COPs including the location-allocation problem can be formulated 
as [44] 


min >) >) cia; (6.23) 
i=1 j=l 
subject to 
X aijt = bi, = We die M (6.24) 
j=l 

X dysl; JS ben (6.25) 

i=l 
zij € {0,1}, t= bhest = liss (6.26) 


where X = [x;;] € {0,1}”*%" is a variable matrix, cij > 1, aj; > 1, and b; > 1, 
are constant integers. 

To make use of the Hopfield network, one needs first to convert a COP into a 
constrained optimization problem and solve the latter using the penalty method. 
The COP defined by (6.23) through (6.26) can be transformed into the minimiza- 
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tion of the following total cost 


m n n m 2 
E= A aa b; + as (>. Tij 1) 
i=1 \J=1 j=1 \i=1 
12> Yay (1 — aaj) + MS exis. (6.27) 
i=1 j=l i=1 j=l 


where A, Ag, Àg and A4 are weights of individual constraints, which can be 
tuned for an optimal or good solution. When the first three terms are all zeros, 
the solution is a feasible one. The cost E has the same form as that of the energy 
function of the Hopfield network, and thus can be solved by using the Hopfield 
network. 

By minimizing the square of (6.23), the network distinguishes optimal solutions 
more sharply than with (6.27) and this greatly overcomes many of the weaknesses 
of the network with (6.27) [44]. 


Combinatorial optimization problems with equality and 
inequality constraints 
The Hopfield network can be used to solve COPs under equality as well as 
inequality constraints, as long as the constructed energy function is of the form 
of (6.5). Constraints are treated by introducing in the objective function some 
additional energy terms that penalize any infeasible state. Some extensions to 
the Hopfield model are necessary in order to handle both equality and inequality 
constraints [59, 2]. 

Assume that we have both linear equality and inequality constraints 


rig=s;, i=1,...,1, (6.28) 
GEZ hji GS pice k, (6.29) 
where r; = (rij,.-- riJ)”, qj = (Q, G0) si is a constant, and h; > 0. 


In the extended Hopfield model [2], each inequality constraint is converted 
to an equality constraint by introducing an additional variable managed by a 
new neuron, known as the slack neuron. Each slack neuron is connected to the 
initial neurons, where their corresponding variables occur in a linear combination. 
The extended Hopfield model has the drawback of being frequently stabilized 
in neuron states far from the suitable ones, i.e. zero and one. To deal with 
this drawback, a new penalty energy term is derived to significantly reduce the 
number of neurons with unsuitable states [38]. 


Escaping local minima for combinatorial optimization problems 


Simulated annealing is a popular method for any optimization problem includ- 
ing COPs. However, due to its Monte Carlo nature, it would require even more 
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Figure 6.6 Schematics of the landscapes of the energy function with one-dimensional variable x and 
different values of gain 8. (a) Low 8 smoothes the surface. (b) High 8 reveals more details in the 


surface. 


6.4.3 


iterations than complete enumeration, for some problems, in order to guaran- 
tee convergence to an exact solution. For example, for an n-city TSP, simulated 
annealing using the logarithmic cooling schedule needs a computational complex- 
ity of O (ae), which is far more than O((n — 1)!) for complete enumeration 
and O (n?2”) for dynamic programming [9, 1]. Thus, one has to apply heuristic 
fast cooling schedules to improve the convergence speed. 

The Hopfield network is desirable for solving COPs that can be formulated into 
quadratic functions. The Hopfield network converges very fast, and it can also 
be easily implemented using RC circuits. However, due to its gradient-descent 
nature, it always gets trapped at the nearest local minimum of the initial random 
state. 

To help the Hopfield network escape from the local minima, a popular strategy 
is to change the sigmoidal gain Ø, by starting from a low gain and gradually 
increasing it. When £ is low, the energy landscape is smooth, and the algorithm 
can easily find a good local minimum. As 8 increases, more details of the energy 
landscape are revealed, and the algorithm can find a better solution. This is 
illustrated in Fig. 6.6. This process is usually called gain annealing, as it is 
analogous to the cooling process of simulated annealing. In the limit, when 3 —> 
co, the hypobolic tangent function becomes the signum function. 

In order to use the Hopfield network for solving optimization problems, one 
needs to construct an energy function using the Lagrange multiplier method. 
By adaptively adjusting the balance between the constraint and objective terms, 
the network can avoid falling into a local minimum and continue to update in 
a gradient-descent direction of energy [65]. In the learning strategy given in 
[65], the minimum found is always a global or a near-global one. The method is 
capable of finding an optimal or near-optimal solution in a short time for TSP. 


Solving other optimization problems 


Tank and Hopfield first proposed the Hopfield network structure to solve the 
linear programming (LP) problem [59]. Their work is extended in [33] to solve 
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a nonlinear programming problem. A class of Lagrange neural networks appro- 
priate for general nonlinear programming, i.e., problems including both equality 
and inequality constraints, is analyzed in [69]. 

Matrix inversion can be performed using the Hopfield network [31]. Given a 
nonsingular n x n matrix A, the energy function can be defined by ||AV — I||}, 
where V denotes the inverse of A and the subscript F denotes the Frobenius 
norm. This energy function can be decomposed into n energy functions, and n 
similar networks are required, each optimizing an energy function. This method 
can be used to solve a system of n linear equations with n variables, Aa = b, 
where A € R"*~" and a, b € R”, if A is nonsingular. In [8], this set of linear 
equations is solved by using a continuous Hopfield network with n nodes. The 
Hopfield network is designed to minimize the energy function E = $||Aa — 6||?, 
and the activation function is selected as a linear transfer function. This method 
is also applicable when there exists infinitely many solutions and A is singu- 
lar. Another neural LS estimator that uses continuous Hopfield network and a 
nonlinear activation function has been proposed in [22]. 

A Hopfield network with linear transfer functions augmented by an additional 
feedforward layer can be used to solve a set of linear equations [62] and to 
compute the pseudoinverse of a matrix [39]. The resultant augmented linear 
Hopfield network can be used to solve constrained LS optimization problems. 

The LP network [59] is designed based on the Hopfield model for solving LP 
problems 


min al a (6.30) 

subject to 
dj > hj, j=1,...,M, (6.31) 
where dj = (dj1,d;.2,.--,d;,7)’, D = [dj] is an M x J matrix, and h; is a con- 


stant. Each inequality constraint is modeled by a slack neuron. The network 
contains a signal plane with J neurons and a constraint plane with M neu- 
rons. The energy function decreases until the net reaches a state where all time 
derivatives are zero. 

With some modifications, the LP network [59] can be used to solve least- 
squares error problems [67]. In [17], a circuit based on a modification of the LP 
network [59] is designed for computing the discrete Hartley transform. A circuit 
for computing the discrete Fourier transform (DFT) is obtained by simply adding 
a few adders to the discrete Hartley transform circuit. 

Based on the inherent properties of convex quadratic minimax problems, a 
neural network model for a class of convex quadratic minimax problems is pre- 
sented in [23]. The model is stable in the sense of Lyapunov and will converge 
to an exact saddle point in finite time by defining a proper convex energy func- 
tion. Furthermore, global exponential stability of the model is shown under mild 
conditions. 
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Chaos and chaotic neural networks 


Chaos, bifurcation, and fractals 


If a system seems to behave randomly, it is said to be chaotic or to demon- 
strate chaos. Chaos is a complicated behavior of a nonlinear dynamical system. 
It is also a self-organized process subject to some underlying rules. Chaos is an 
omnipresent phenomenon in biological behavior and evolution. There is evidence 
that parts of the brain, as well as individual neurons, exhibit chaotic behavior. 
Chaos, together with the theory of relativity and quantum mechanics, was con- 
sidered one of the three monumental discoveries of the twentieth century. In fact, 
chaos theory is closely related to Heisenberg’s uncertainty principle. 

Chaotic systems can be either deterministic chaos or nondeterministic chaos. 
For deterministic chaos, the system’s behavior can be approximately or exactly 
represented by a mathematically or heuristically expressed function. For non- 
deterministic chaos, the system’s behavior is not expressible by a deterministic 
function and therefore is not at all predictable. 

A dynamic system is a system whose state varies over time. It can be repre- 
sented by state equations. A stable system usually has fixed-point equilibiums, 
called attractors. Strange attractors are a class of attractors that exhibit a chaotic 
behavior. A chaotic process can be classified according to its fractal dimensions 
and Lyapunov exponent. In a typical chaotic system, there exists bifurcation 
points that lead to chaos, and self-similarity and fractals. Time delays can be 
the source of instabilities and bifurcations in dynamical systems and are fre- 
quently observed in biological systems such as neural networks. 

Bifurcation is a common phenomenon found in chaotic systems which indicates 
sudden, qualitative changes in the dynamics of the system either from one kind of 
periodic case (with limit cycle(s) and fixed point(s)) to another kind of periodic 
situation, or from a periodic stage to a chaotic stage. 


Example 6.2: A well-known chaotic function is the one used to model fish popula- 
tion growth: a(n + 1) = a(n)(1 — z(n)). We can represent this logistic function 
in a slightly modified form as: x(n + 1) = ra(n)(1 — x(n)), for the bifurcation 
parameter r € [0,4]; iteration starts with a randomly selected x(0). Figure 6.7 
depicts the bifurcation diagram (x versus r). 

The behavior of this first-order differential equation changes dramatically as 
r is altered. For r < 1 the output x goes to zero. For 1 < r < 3 the output con- 
verges to single non-zero value. These are stable regions. When r goes beyond 
3, the process begins alternating between two different points without converg- 
ing to either; the output begins to oscillate between two values initially, then 
between four values, then between eight values, and so on. After r = 3.44948... 
the two curves split further into four whereas the iteration oscillate between four 
different points. The horizontal distance between the split points grows shorter 
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Figure 6.7 Bifurcation diagram of the logistic equation. 


6.5.2 


and shorter, until the bifurcation becomes so fast at the point r = 3.568 that 
iterates race all over a segment instead of alternating between a few fixed points. 
For r > 3.568, the output becomes chaotic. The behavior is chaotic in the sense 
that it’s absolutely impossible to predict where the next iterate will appear. 
Figure 6.7a shows the local minima and maxima of the asymptotic time series 
against the parameter r. Figure 6.7b shows representative time series with fixed 
point at r = 2.9, limit cycle at r = 3.3 and chaotic attractors at r = 3.9. 


The phase space of a chaotic process is the feature space where the process 
is traced over time. A chaotic process goes around areas or points of its phase 
space, but without repeating the same trajectory; such areas, or points, from 
the phase space are called chaotic attractors. From a geometrical point of view, 
all chaotic attractors are fractals, while all fractals have their bifurcations and 
chaotic features. Fractals are a graphical presentation of chaos while chaos is the 
physical dynamics of fractals. Fractal dimension indicates the complexity of a 
chaotic system, for example, a low-dimensional attractor (3 to 4) would suggest 
that the problem is solvable. A chaotic process has a fractal dimension, which is 
defined by its attractors. 


Chaotic neural networks 


Chaotic neural networks with the sigmoidal activation functions were successfully 
implemented for solving various practical problems. Chaotic simulated annealing 
was utilized to solve COPs, such as TSP [9], [11]. It has a much better searching 
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ability in solving TSP, in comparison with the Hopfield network, the Boltzmann 
machine, and the Gaussian machine [49]. 

A recurrent network such as the Hopfield network, when introduced with 
chaotic dynamics, is sometimes called a chaotic neural network. The chaotic 
dynamics are temporarily generated for searching and self-organizing, and even- 
tually vanish with the autonomous decrease of a bifurcation parameter cor- 
responding to the temperature in the simulated annealing process. Thus, the 
chaotic neural network gradually approaches to a dynamical structure of the 
recurrent network. 

Since a chaotic neural network operates in a manner similar to that of simu- 
lated annealing, but in a deterministically chaotic way, the operation is known as 
chaotic simulated annealing. More specifically, the transiently chaotic dynamics 
are used for searching a basin containing the global optimum, followed by a sta- 
ble and convergent phase when the chaotic noise decreases to zero. As a result, 
the chaotic neural network has a high ability for searching globally optimal or 
near-optimal solutions [9]. 

Simulated annealing, employing the Monte Carlo scheme, searches all the pos- 
sible states by temporally changing the probability distributions. In contrast, 
chaotic simulated annealing searches a possible fractal subspace with continu- 
ous states by temporally changing invariant measures that are determined by its 
dynamics. Thus, the search region in chaotic simulated annealing is very small 
compared with the state space, and chaotic simulated annealing can perform an 
efficient search. 

A small amount of chaotic noise can be injected into the output of the neurons 
and/or to the weights during the operation of the Hopfield network. In [26], a 
chaotic neural network is obtained by adding chaotic noise to each neuron of the 
discrete-time continuous-output Hopfield network and gradually reducing the 
noise so that it is initially chaotic, but eventually convergent. 

A chaotic neural network based on a modified Nagumo-Sato neuron model was 
proposed in [4] in order to explain complex dynamics observed in a biological 
neural system. The chaotic neural network introduced in [9, 4] is obtained by 
adding a negative self-coupling to the Hopfield network. By gradually removing 
the self-coupling, the transient chaos is used for searching and self-organizing. 
The updating rule for the chaotic neural network is given by [9] 


net;(t{+1) = (1 — 2) neti(t) + ~ (w7 x + 0;) — c(t) (a — b), (6.32) 
xilt) = db (net;(t)) , (6.33) 


where c(t + 1) = c(t), 6 € [0,1], the bias b > 0, and other parameters are the 
same as for (6.8). A large initial value of c(t) is used so that self-coupling is 
strong enough to generate chaotic dynamics for searching the global minima. 
The damping of c(t) produces successive bifurcations so that the neurodynamics 
eventually converge from strange attractors to a stable equilibrium point. 
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It is shown in [49] that the Euler approximation of the continuous-time Hop- 
field network with a negative neuronal self-coupling exhibits chaotic dynamics 
and that this model is equivalent to a special case of a chaotic network proposed 
in [4] after a variable transformation. 

The chaotic simulated annealing approach is derived by varying the time step 
At of an Euler discretization of the Hopfield network described by (6.8) [63] 


+ == 


Ti 


(wrz +9;). (6.34) 





net;(t + At) = net;(t) (1 — 
l 

The time step is analogous to the temperature parameter in simulated annealing, 
and the method starts with large At, where the dynamics are chaotic, and gradu- 
ally decreases it. When At — 0, the system approaches the Hopfield model (6.8) 
and minimizes its energy function. When At = 1, the Euler-discretized Hopfield 
network is identical to the chaotic neural network given in [9]. The simulation 
results for COPs are comparable to that of the method proposed in [9]. 

Many chaotic approaches [9, 63, 26] can be unified and compared under the 
framework of adding an extra energy term Ecsa into the original computational 
energy (6.11) of the Hopfield model [35]. The extra energy term modifies the 
original Hopfield energy landscape to accommodate transient chaos. For example, 
for the method proposed in [9], Ecsa can be selected as 


Ecsa = a Ds zi (xj —1). (6.35) 


This results in many logistic maps being added to the Hopfield energy function. 
Ecsa is convex, and hence drives x toward the interior of the hypercube. This 
driving force is diminished as Ecga — 0 when A(t) — 0. 

A theoretical explanation for the global searching ability of the chaotic neural 
network is given in [11]: its attracting set contains all global and local optima of 
the optimization problem under certain conditions, and since the chaotic attract- 
ing set has a fractal structure and covers only a very small fraction of the entire 
state space, chaotic simulated annealing is more efficient in searching for good 
solutions for optimization problems compared to simulated annealing. 

However, a number of network parameters must be subtly adjusted so as to 
guarantee the convergence of the chaotic network. Chaotic simulated annealing is 
not guaranteed to settle down at a global optimum no matter how slowly anneal- 
ing is carried out, because the chaotic dynamics are completely deterministic. 
Stochastic chaotic simulated annealing [64] is a combination of simulated anneal- 
ing and chaotic simulated annealing by using a noisy chaotic network, which is 
obtained by adding decaying stochastic noise into the chaotic network proposed 
in [9]. Stochastic chaotic simulated annealing restricts the random search to a 
subspace of chaotic attracting sets, and this subspace is much smaller than the 
entire state space searched by simulated annealing. 

The Wang-oscillator, or Wang’s chaotic neural oscillator [66], is a neural oscil- 
latory model consisting of two neurons, one excitatory and one inhibitory. It 
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encodes the input stimulus and gives the responses by altering the behaviour 
of the neural dynamics. Inspired by the Wang-oscillator, the Lee-oscillator [37] 
provides a transient chaotic progressive growth in its neural dynamics, which 
solves the fundamental shortcoming of the Wang-oscillator. The main purpose 
of Aihara’s chaotic network is to realize the dynamical pattern association, so 
it does not converge to any particular pattern, but rather oscillating against 
different stored patterns in a chaotic manner. Using Lee-oscillators upon pat- 
tern association, a chaotic auto-associative network, namely Lee-associator, is 
constructed. Lee-oscillator provides gradual and progressive changes of neural 
dynamics in the transition region while Aihara’s chaotic network provides an 
abrupt change in the neural dynamics in this region. Compared with the chaotic 
auto-associative networks developed in [3] and [66], Lee-associator produces a 
robust progressive memory recalling scheme. Lee-associator provides a remark- 
able progressive memory association scheme during the chaotic memory associ- 
ation. This is completely consistent with the latest research in psychiatry and 
perception psychology on dynamic memory recalling schemes, as well as the 
implications and analogues to human perception as illustrated by the remark- 
able Rubin-vase experiment on visual psychology. 


Multistate Hopfield networks 


The multilevel Hopfield network [21, 68, 54, 70] and the complex-valued mul- 
tistate Hopfield network [32, 47] are two direct generalizations of the Hopfield 
network. The multilevel Hopfield network uses neurons with an increasing multi- 
step function as the activation function [21], while the complex-valued multistate 
Hopfield network uses a multivalued complex-signum function as the activation 
function. 

The multilevel sigmoidal function has typically been used as the activation 
function in the multilevel Hopfield network. In [68], a multilevel Hopfield-like 
network is obtained by using a new neuron with self-feedback and the multilevel 
sigmoidal activation function. The multilevel model has been applied for A/D 
conversion, and a circuit implementation for the neural A/D converter has been 
fabricated [68]. For an activation function of N levels, sufficient conditions are 
given in [42] ensuring that the networks have 2N + 1 equilibriums with N +1 
of them as locally stable points, as well as some criteria guaranteeing global 
and complete stability of the Hopfield networks with multi-level activation func- 
tions by using the continuity property of neuron state variables and Lyapunov 
functional. 

The use of multistate neurons leads to a network architecture that is signifi- 
cantly smaller than that of the Hopfield network, and hence, a simple hardware 
implementation. The reduction in the network size is highly desirable in large- 
scale applications such as image restoration and TSP. In addition, the complex- 
valued multistate Hopfield network is also more efficient and convenient than 
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the Hopfield network in the manipulation of complex-valued signals. Storage 
capacity can be improved by using complex-valued multistate Hopfield networks 
[32, 47], or replacing bi-level activation functions with multi-level ones [70, 42]. 

The complex-valued multistate Hopfield network [32, 47] employs the multi- 
valued complex-signum activation function that is defined as an L-stage phase 
quantizer for complex numbers 


ar. arg(u) € [0, po) 


a Je arg(u) € [po, 20) 
csign(u)= ¢ , ; (6.36) 


ot, arg(u) € [(L — 1)y0, Lyo) 


2m is the Lth root of unity. Each state takes one of 


the equally spaced L points on the unit circle of the complex plane. 
Similar to the Hopfield network, the network dynamics are defined by 


where z = e/%° with yo = 


J 
net; (t) = DD Wtr). T= lnd, (6.37) 
k=1 


a(t + 1) = csign (net:(t) . z2?) P — R A (6.38) 


where J is the number of neurons, and the factor z3 = ei? places the resulting 
states in the angular centers of each sector. A sufficient condition for the sta- 
bility of the dynamics is that the weight matrix is Hermitian with non-negative 
diagonal entries, that is, W = W”, w; > 0 [32]. The energy can be defined as 


E(x) = — 5a" We. (6.39) 


Cellular neural networks 


A cellular network is a two- or higher-dimensional array of cells, with only local 
interactions that can be programmed by a template matrix [13]. It is made of a 
massive aggregate of regularly spaced circuit clones, called cells, which commu- 
nicate with one another directly only through its nearest neighbors. Each cell is 
made of a linear capacitor, a nonlinear voltage-controlled current source, and a 
few resistive linear circuit elements. Thus, each cell has its own dynamics whose 
evolution is dependent on its circuit time constant T = RC. 

The cellular network is a generalization of the Hopfield network, and can be 
used to solve a more generalized optimization problem. It overcomes the massive 
interconnection problem of parallel distributed processing. The key features are 
asynchronous parallel processing, continuous time dynamics, local interactions 
among the network elements, and VLSI implementation. 
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Due to its local interconnectivity property, cellular network chips can have 
high-density cells, and some physical implementations, such as analog CMOS, 
emulated digital CMOS and optical implementations, are available. The cellular 
network universal machine [53] is the analog cellular computer for processing 
analog array signals. It has a computational power of tera (101?) or peta (1015) 
analog operations per second on a single CMOS chip [14]. 

The cellular network with a two-dimensional array architecture is a natural 
candidate for image processing or simulation of partial differential equations. 
Any input-output function to be realized by the cellular network can also be 
visualized as an image-processing task, where the external input, the initial con- 
dition and the output, arranged as two-dimensional arrays, are, respectively, the 
external input, initial and output images. The external input image together 
with the initial image constitutes the input images of the cellular network. Using 
different cloning templates, namely, the representation of the local interconnec- 
tion patterns, different operations can be conducted on an image. The cellular 
network has become an important method for image processing, and has been 
applied to many signal processing applications such as image compression, fil- 
tering, learning and pattern recognition. 

The image processing tasks using cellular networks were mainly developed 
for black and white output images, since a cell in the cellular network has a 
stable equilibrium point at the two saturation regions of the piecewise linear 
output function after the transient has decayed toward equilibrium. Due to the 
cellular network characteristics, each pixel of an image corresponds to a cell of 
a cellular network. In contrast, the nonlinear dynamics of a cellular network 
with a two-level output function converts the input images into bilevel pulse 
digital sequences. Cellular network can convert the multi-bit image into an opti- 
mal binary halftone image. This significant characteristic of a cellular network 
suggests the possibility of a spatial domain sigma-delta modulation [5]. The pro- 
posed system can be treated as a very large-scale and super-parallel sigma-delta 
modulator. 


6.1 Form the energy function of the Hopfield network, and show that if æ* is a 
local minimum of E, then —a* is also a local minimum of E. This explains the 
reason for negative fundamental memories. 


6.2 Lyapunov’s second theorem is usually used to verify the stability of a linear 
system « = Ax. The Lyapunov function can be selected as E(x) = xT Qg, where 
Q can be obtained from ATQ + QA = —I, I being the identity matrix. Verify 


the stability of the linear system with A = p l ; 
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6.3 For the assignment problem given by (6.23) to (6.26), assume m = n, aij = 
1, b; = 1. The energy function can be constructed as 


i=1 \g=l j=l \i=1 
n n n n 
+ 5 5 Lij (1 = Liz) + ae t/t 5 5 Cig Liz (t); 
i=1 j=1 i=1 j=1 
—t/T 


where a and 7 are positive constants. ae is an anealing factor to gradually 
balance the cost and constraints. 
(a) Design a Hopfield network to solve this problem. 


(b) What is the optimal solution? 


6.4 Write a program to solve a TSP of N cities using a Hopfield network. The 
objective is to minimize the total length of a tour, Lp, where P is a permutation 
of the N cities. 


6.5 Use simulated annealing to solve TSP of N cities. 


6.6 A typical chaotic function is the Mackey-Glass chaotic time series is gen- 
erated from the following delay differential equation: 


da(t) /dt = [0.22(t — D)]/[1 + zt? (t — D)| — 0.12(2), 


where D is a delay. 

(a) Plot the time series for different D values. 

(b) Plot the bifurcation diagram. 

(c) Verify that for D > 17 the function shows a chaotic behavior. 


6.7 For Aihara’s chaotic neural network, the neural dynamics of a single neuron 
is given by [10]: 
1 


s(t +1) = —— __, 
1 + exp (-: [ken 2 +wa(t) — wao +1]) 


where x(t) is output of the neural oscillator, € is the steepness parameter, k is 
the damping factor and I is the external stimulus. By using € = 1/250, k = 0.9, 
ao = 0.5 and w = —0.07 [10], plot the bifurcation diagram (x versus I) of Aihara’s 
cellular network. [Hint: Refer to [37]]. 


6.8 Solve TSP using: 
(a) The Hopfield network. 
(b) Chen and Aihara’s method [9]. [Hint: Refer to [64]]. 


6.9 A 3-bit A/D converter can convert a continuous analog signal y(t) € [0,7] 
into 3-bit representation (£2, £1, £0), where xo is the least significant bit. The 
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quadratic objective function can be given by 


2 2 
ll 4 
J (tg, £1, £2) = 5 ( = 5 2) 
i=0 


subject to x; € {0,1}, or x (1 — x;) = 0, i = 0,1,2. 

We need to define a Lagrangian that matches the energy function of the Hop- 
field network. This can be achieved by setting the value of Lagrange multipliers 
Ai, i = 0,1,2, and omitting constants that are not associated with zis. 

(a) Solve for the weights and inputs of a Hopfield network. 
(b) Give the network dynamics for solving this problem. 


6.10 Generalize the design in Problem 1.8 to an n-bit A/D converter. Show 
that the weights and inputs can be set by wij = —2’*) when i Æ j, or 0 when i = 
j, and J; = —27-! + 2'x, i, j =0,1,...,n — 1. According to Tank and Hopfield 
[59], operation of this Hopfield A/D converter for every new analog input x 
requires reintialization of all states to zero. 


6.11 A saddle node is an illustrative bifurcation. For d =a + 2”, investigate 
the fixed points for a < 0, a = 0, and a > 0. Plot z versus a, using solid lines to 


denotes fixed-point attractors and dashed lines for fixed point repellors.) 
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Introduction 


The human brain stores the information in synapses or in reverberating loops of 
electrical activity. Most of existing associative memory models store information 
in synapses. However, loop-based algorithms can learn complex control tasks 
faster, with exponentially fewer neurons, and avoid the problem of weight trans- 
port, but with long feedback delays [28]. They explain aspects of consolidation, 
the role of attention, and the relapses. 

Association is a salient feature of human memory. The brain recalls by associ- 
ation, that is, the brain associates the recalled item with a piece of information 
or with another item. Associative memory models, known as content-addressable 
memories, are well analyzed. A memory is a system with three functions or 
stages: recording—storing the information, preservation—keeping the informa- 
tion safely, and recalling-retrieving the information. A pattern can be stored in 
memory through a learning process. For an imperfect input pattern, associative 
memory has the capability to recall the stored pattern correctly by performing 
a collective relaxation search. Associative memories can be either heteroasso- 
ciative or autoassociative. For heteroassociation, the input and output vectors 
range over different vector spaces, while for autoassociation, both the input and 
output vectors range over the same vector space. Neural associative memories 
have applications in different fields, such as image processing, pattern recognition 
and optimization. 

Episodic memory allows one to remember his own experiences in an explicit 
and conscious manner. Episodic memory is crucial in supporting many cogni- 
tive capabilities, including concept formation, representation of events in spatio- 
temporal dimension, and record of progress in goal processing [20]. Two basic 
elements of episodic memory are events and episodes. An event can be described 
as a snapshot of experience. Usually, a remembered event can be used to answer 
critical questions such as what, where, and when. An episode can be consid- 
ered as a temporal sequence of events. Three major tasks in episodic memory 
retrieval are event detection, episode recognition, and episode recall. Forgetting 
should exist in memory to avoid information overflow. 

Recognition memory is involved with two types of retrieval processes: famil- 
iarity and recollection. When presented with an item, one might have a sense of 
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recognition but cannot recall the detail of the stimulus encountered before. This 
is called familiarity memory. Familiarity distinguishes whether the stimulus was 
previously encountered. The medial temporal lobe and the prefrontal cortex play 
a critical role in familiarity memory [23]. Recollection retrieves detailed informa- 
tion about an experienced event. Familiarity capacity is typically proportional 
to the number of synapses within the network [11], whereas the capacity for rec- 
ollection is typically proportional to the square root of the number of synapses, 
that is, the number of neurons in a fully connected network [6]. Mean field anal- 
ysis indicates that the capacity of the familiarity discriminators that are based 
on a neural network model are bigger than that of recollection capacity [23]. 

Research on neural associative memories originated in the 1950s with matrix 
associative memories [76]. In 1972, the linear associative memory was intro- 
duced, independently, by several authors, where correlation or Hebbian learning 
is used to synthesize the synaptic weight matrix [8], [42], [3]. The brain-state-in- 
a-box (BSB) network is a discrete-time nonlinear dynamical system as a mem- 
ory model based on neurophysiological considerations. The Hopfield model [35] 
is a continuous-time, continuous-state dynamic associative memory model. The 
binary Hopfield network is a well-known model for nonlinear associative memo- 
ries. It can retrieve a pattern stored in memory in response to the presentation 
of a corrupted version of the pattern. This is done by mapping a fundamen- 
tal memory æ onto a stable point of a dynamical system. Kosko [44] extended 
the Hopfield associative memory to bidirectional associative memory (BAM) by 
incorporating an additional layer to perform recurrent autoassociation or het- 
eroassociation. 

Linear associative memories, BSB [9], and BAM [44, 10] can be used as both 
autoassociative and heteroassociative memories, while the Hopfield model, the 
Hamming network [50], and the Boltzmann machine can only be used as autoas- 
sociative models. Perfect recall can be guaranteed by imposing an orthogonality 
condition on the stored patterns. The optimal linear associative memory, which 
employs the projection recording recipe, is not subject to this constraint [43]. 
The optimal linear associative memory, though exhibits a better storage capac- 
ity than the linear associative memory model, has low noise tolerance. 

An autoassociator is a brain-like distributed network that learns from the 
samples in a category to reproduce each sample at the output with a mapping. An 
autoassociator learns normal examples; when an unknown pattern is presented, 
the reconstruction error by the autoassociator will be compared with a threshold 
to signal whether it is a novel pattern (with larger error) or a normal pattern 
(with smaller error). This classification methodology has been applied to various 
detection problems such as face detection, network security, and natural language 
grammar learning. 

Memory is important for transforming a static network into a dynamic one. 
Memories can be long-term or short-term. A long-term memory is used to store 
stable system information, while a short-term memory is useful for simulating a 
dynamic system with a temporal dimension. For a Hopfield network, the states 
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of the neurons can be considered as short-term memories while the synaptic 
weights can be treated as long-term memories. Feedforward networks can become 
dynamic by embedding memory into the network using time delay. Recurrent 
network models such as the Hopfield model and the Boltzmann machine are 
popular associative memory models. 

Other associative memories models with unlimited storage capacity include the 
morphological associative memory [66], [67]. Morphological associative memories 
employ the min and max operations that are used in mathematical morphology. 
The morphological models are very efficient to recall patterns corrupted either 
with additive noise or subtractive noise. 


Hopfield model: storage and retrieval 


Operation of the Hopfield network as an associative memory includes two phases: 
storage and retrieval. Bipolar coding is often used for associative memory in 
that bipolar vectors have a greater probability of being orthogonal than binary 
vectors. We use bipolar coding in this chapter. 

We now store in the network a set of N bipolar patterns, {#,}, where 
Vy = (tp i Ipai „£p J)”, £pi = +1. These patterns are called fundamental 





memories. Storage is implemented by using a learning algorithm, while retrieval 
is based on the dynamics of the network. 


Generalized Hebbian rule 


Conventional algorithms for associative storage are typically local algorithms 
based on the Hebbian rule. The Hebbian rule is known as the outer product rule 
of storage in connection with associative learning. Using this method, @ is chosen 
as the zero vector. A generalized Hebbian rule for training the Hopfield network 
is defined by [35] 


N 
1 
Wij = J Yes for all 7 x J, (7.1) 
p=1 


and wi; = 0. In matrix form 


where I; denotes the J x J identity matrix. 
The generalized Hebbian rule can be written in an incremental form 


Wij (t) = Wij (t = 1) + NL i Lt, 75 for all 7 Ze Te (7.3) 


where the step size 7 = 5, t=1,...,N, and w;;(0) =0. As such, learning is 
completed after each pattern x; in the pattern set is presented exactly once. 
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The generalized Hebbian rule is both local and incremental. It has an absolute 
storage capability of Nmax = aa [56]. The storage capability of an associative 
memory network is defined by the maximum number of fundamental memories, 
Nmax, that can be stored and retrieved reliably. For reliable retrieval, Nmax is 
dropped to approximately 1 [56]. The generalized Hebbian rule, however, suf- 
fers severe degradation and Nmax decreases significantly, if the training patterns 
are correlated. For example, time series usually include significant correlations 
in the measurements of adjacent samples. Some variants of the Hebbian rule, 
such as the weighted Hebbian rule [3] and the Hebbian rule with decay [43], can 
increase the storage capability. 

When training associative memory networks using classical Hebbian learning, 
an additional term called crosstalk may arise. When crosstalk becomes too large, 
spurious states other than the negative stored patterns appear [68]. The number 
of negative stored patterns is always equivalent to the number of stored patterns. 
Hebbian learning produces good results when the stored patterns are nearly 
orthogonal. This is the case when N bipolar vectors are randomly selected from 
R’, and N < J. In practice, patterns are usually correlated and the incurred 
crosstalk may reduce the capacity of the network. The storage capability of the 
network is expected to decrease if the Hamming distance between the fundamen- 
tal memories becomes smaller. 

An improved Hebbian rule is given by local and incremental learning rule 
(71, 72] 


Wij (t) = Wij (t = 1) + n [Erieg a hyi(t) oes =x hij (t)xe,5] ; (7.4) 


J 
hglt)= >> walt-Deeu, (7.5) 
u=1,uzi,j 

where 7 = 5, t=1,2,..., N, wi;(0) = 0 for alli and j, and hij is a form of local 
field at neuron i. This rule has an absolute capacity of = for uncorrelated 
patterns. It also performs better than the generalized Hebbian rule for correlated 
patterns [71]. It does not suffer significant capacity loss when patterns with 
medium correlation are stored. 


Pseudoinverse rule 


The pseudoinverse solution targets at minimizing the crosstalk between the 
stored patterns. The pseudoinverse rule uses the pseudoinverse of the pattern 
matrix, while classical Hebbian learning uses the correlation matrix of the pat- 
terns [65, 40]. 

Denoting X = |£1, £2,..., £y], the autoassociative memory is defined as 


XTW = xX’. (7.6) 
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Using pseudoinverse, we actually minimize EF = |X? Ww — XT || p> thus minimiz- 
ing the crosstalk in the associative network. The pseudoinverse solution for the 
weight matrix is given by 


W = (X7)' x7. (7.7) 


The pseudoinverse rule, also called the projection learning rule, is neither incre- 
mental nor local. It involves inverting an N x N matrix, thus training is very 
slow and impractical. 

The pseudoinverse solution performs better than Hebbian learning when the 
patterns are correlated. Both the Hebbian and pseudoinverse rules are general- 
purpose methods for training associative memory networks that can be repre- 
sented as XTW = X’, where X and X are, respectively, the stored and associ- 
ated pattern matrices. For an autoassociated pattern x;, the weights generated 
from Hebbian learning projects the whole input space into the linear subspace 
spanned by æ;. The projection, however, is not orthogonal. Instead, the pseu- 
doinverse solution provides orthogonal projection to the linear subspace spanned 
by the stored patterns [68]. Theoretically, for N < J and uncorrelated patterns, 
the pseudoinverse solution has a zero error, and the storage capability in this case 
is Nmax = J — 1 [40, 68]. It is shown in [72] that the Hebbian rule is the zeroth- 
order expansion of the pseudoinverse rule, and the improved Hebbian rule given 
by (7.4) and (7.5) is one form of the first-order expansion of the pseudoinverse 
rule. 

The pseudoinverse rule is also adapted to sorting sequences of prototypes, 
where an input a; leads to an output 2;41;. The MLP with BP can be used to 
compute the pseudoinverse solution when the dimension J is large, since direct 
methods to solve the pseudoinverse will use up the memory and the convergence 
time is intolerably large [40]. 


Perceptron-type learning rule 


The rules addressed above are one-shot methods, in which the network training 
is completed in a single epoch. A learning problem in a Hopfield network with J 
units can be transformed into a learning problem for a perceptron of dimension 
Ty) [68]. This equivalence between Hopfield networks and perceptrons leads to 
the conclusion that every learning algorithm for perceptrons can be transformed 
into a learning algorithm for Hopfield networks. 

Perceptron learning algorithms for storing bipolar patterns in Hopfield net- 
works have been discussed in [40]. They are simple, online, local algorithms. 
Unlike Hebbian rule-based algorithms, perceptron learning-based algorithms 
work over multiple epochs and often reduce the error nonmonotonically over 
the epochs. The perceptron-type learning rule is given by [38, 40] 

1 


wi (t) = wy (t — 1) +N |t itti — 5 (Yt ilt j + Yt jTi) (7.8) 
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where y; = sgn (Wz), t=1,...,.N, the learning rate 7 > 0, and w;;(0)s are 
small random numbers or zero. In this chapter, the signum function sgn(z) is 
defined as 1 for x > 0 and —1 for x < 0. If selecting w,;(0) = 0 for all i, j, 7 can 
be selected as any positive number; otherwise, 7 can be selected as a number 
of the same order of magnitude or larger than the weights. This accelerates the 
convergence process. Notice that wji(t) = wi; (t). 

However, when the signum vector is not realizable, the perceptron-type rule 
does not converge but oscillates indefinitely. The perceptron-type rule can be 
viewed as a supervised extension of the Hebbian rule by incorporating a term 
for correcting unstable bits. For a recurrent network, the storage capability of 
the perceptron-type algorithm can reach the upper bound Ninax = J, for uncor- 
related patterns. 

An extensive experimental comparison between a perceptron-type learning 
rule [40] and the generalized Hebbian rule has been made in [38] on a wide range 
of conditions on the library patterns: the number of patterns N, the pattern 
density p, and the amount of correlation of the bits in a pattern, decided by block 
size B. In terms of stability of the library patterns and error-correction ability 
during the recall phase, the perceptron-type rule is found to be perfect in ensuring 
stability of the stored library patterns under all the evaluated conditions, while 
the generalized Hebbian rule degrades rapidly as N is increased, or p is decreased, 
or B is increased. In many cases, the perceptron-type rule works much better 
than the generalized Hebbian rule in correcting pattern errors. 


Retrieval stage 


After the bipolar words have been stored, the network can be used for infor- 
mation retrieval. When a J-dimensional vector (bipolar word) x, representing 
a corrupted or incomplete memory of the network, is presented to the network 
as its state, information retrieval is performed automatically according to the 
network dynamics given by (6.1) and (6.2), or (6.3) and (6.4). For hard-limiting 
activation function, the discrete form of the network dynamics can be written as 


J 
ai(t+1)=sen| So wirit], 1=1,2,...,J, (7.9) 
j=l 
or in matrix form 
«(t+ 1) =sgn(Wa(t) + 8), (7.10) 


where «(0) is the input corrupted memory, and a(t) represents the retrieved 
memory at time t. The retrieval process continues until the state vector x remains 
unchanged. The convergent x is a fixed point or the retrieved memory. 

Models such as the complex-valued Hopfield network are not as tolerant with 
respect to incomplete patterns and salt/pepper noise. However, they perform 
better in the presence of Gaussian noise. An essential feature of the noise acting 
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on a pattern is its local nature. If a pattern is split into enough sub-patterns, 
a few of them will be less or more affected by noise, others will remain intact. 
A simple but effective methodology exploits this fact for efficient restoration of 
a pattern [24]. A pattern is restored if enough of its sub-patterns are restored. 
Since several patterns can share the same sub-patterns, the final decision is 
accomplished by means of a voting mechanism. Before deciding if a sub-pattern 
belongs to a pattern, sub-pattern restoration in the presence of noise is done by 
an associative memory. 


Storage capability of the Hopfield model 


In Section 7.2, the storage capability for each of the four storage algorithms 
is given. In practice, there are some upper bounds on the storage capability of 
general recurrent networks. An upper bound on the storage capability of a class of 
recurrent networks with zero-diagonal weight matrix is derived deterministically 
in [1]. 


Theorem 7.1 (Upper bound [1]). For any subset of N binary J-vectors, in 
order to find a corresponding zero-diagonal weight matriz W and a bias vector 
0 such that these vectors are fixed points of the network 


x; = sgn(Wz; +0), i=1,2,..., N, (7.11) 


one needs to have N < J. 


Thus, the upper bound on the storage capability is Nmax = J. This bound 
is valid for any learning algorithm for recurrent networks with a zero-diagonal 
weight matrix. The Hopfield network, having a symmetric zero-diagonal weight 
matrix, is one such network, and as a result, the Hopfield network can at most 
stably store J patterns. 

The upper bound introduced in Theorem 7.1 is too tight, since it requires that 
all the N-tuple subsets of bipolar J-vectors are retrievable. It is also noted in 
[77] that any two patterns differing in precisely one component cannot be jointly 
stored as stable states in the Hopfield network. 

An in-depth analysis of the Hopfield model’s storage capacity has been done by 
[5] by relying on a mean-field approach and on replica methods originally devel- 
oped for spin-glass models. Hopfield networks, when coupled with this learning 
rule, are unlikely to store more than 0.14N uncorrelated random patterns. 

A better way of storing patterns is given by an iterative version of the Hebbian 
rule [29]. At each learning iteration, the stability of every nominal pattern %” 
is tested. Whenever one pattern has not yet reached stability, the responsible 
neuron t reinforces its connectivity by adding a Hebbian term to all the synaptic 
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connections impinging on it, 


All patterns to be learned are repeatedly tested for stability, and once all are 
stable, learning is complete. This learning algorithm is incremental since learning 
of new information can be done by preserving all information that has already 
been learned. By using this procedure, the capacity can be increased up to 2N 
uncorrelated random patterns [29]. 

When permitting a small fraction € of a set of N bipolar J-vectors irretrievable, 
the upper bound approximates 2J when J — oo. This is given by a theorem 
derived from the function counting theorem (Theorem 2.2) [30, 77, 40]. 


Theorem 7.2 (Asymptotical upper bound). For N prototype vectors in 
general position, the storage capacity Nmax can approach 2J, in the sense that, 
for any e >Q the probability of retrieving a fraction (1— €) of any set of 2J 
vectors tends to unity when J > œœ. 


N prototype vectors in general position means that any subset of up to N 
vectors is linearly independent. Theorem 7.2 is more general than Theorem 7.1, 
since there is no constraint on W. This recurrent network is sometimes referred 
to as the generalized Hopfield network. Both theorems hold true irrespective of 
the updating mode, be it synchronous or asynchronous. 

The generalized Hopfield network with a general, zero-diagonal weight matrix 
has stable states in randomly asynchronous mode [54]. The asymptotic storage 
capacity of such a network using the perceptron learning scheme has been ana- 
lyzed in [55]. The perceptron learning rule with zero bias is used to compute 
the columns of W for each neuron independently, and as such the entire W is 
constructed. A lower and an upper bound of the asymptotic storage capacity are 
obtained as J — 1 and 2J, respectively. 

In a special case of the generalized Hopfield network with zero bias vector, 
some spectral strategies are used for constructing W [77]. All the spectral storage 
algorithms have a storage capacity of J for uncorrelated patterns [77]. A recursive 
implementation of the pseudoinverse spectral storage algorithm has also been 
given. 


Example 7.1: 

This example is designed to check the storage capabilities of the Hopfield 
network trained with three local algorithms, namely, the generalized Hebbian, 
improved Hebbian, and perceptron-type learning rules. After the Hopfield net- 
work is trained with a pattern set, we present the same pattern set and examine 
the average retrieval bit error rates and the average storage error rates for a 
number of random runs. 


ww ai bbt.com DOOOO000 


202 


Chapter 7. Associative memory networks 


A set of bipolar patterns each having a bit-length of J = 50 is given. The 
pattern set {a;} is generated randomly. Theoretically, the storage capacities for 
the generalized Hebbian, improved Hebbian, and perceptron-type learning rules 
are a = 6.39, == = 17.88, and J = 50, respectively. These capacities have 
been verified during our experiments. After the Hopfield network is trained, the 
maximum number of iterations at the performing stage is set as 30. The bit error 
rates and storage error rates are calculated based on 50 random runs. 

Simulation is conducted for the case of N uncorrelated patterns as well as N 
slightly correlated patterns. In the case of N uncorrelated patterns, the matrix 
composed of the randomly generated patterns are of full rank, that is, having 
a rank of N. In the case of N slightly correlated patterns, N — 1 patterns are 
randomly generated and are uncorrelated; the remaining one pattern is generated 





by linearly combining any three of the N — 1 patterns and then applying the 
signum function, until the corresponding matrix has a rank of N — 1. 

For perceptron-type learning, we can select W (0) as a symmetrical, random, 
zero-diagonal matrix with each entry in the range of (—0.1,0.1) or as the zero 
matrix. Our empirical results show that the latter scheme can generate better 
results, and it is used here. ņ is selected as 0.2. The maximum number of epochs 
is set as 50. Training terminates when the relative energy change between two 
epochs is below 1074. 

We store a set of N = 20 patterns. The training and performing results are 
shown in Table 7.1, and the evolution of the system energy during the training 
process for a random run is illustrated in Fig. 7.1. 

During the retrieval stage, if the fundamental memories are presented to the 
network, the desired patterns can usually be produced by the network after one 
iteration if N is less than the capacity of the network. The network trained by 
the generalized Hebbian rule cannot correctly retrieve any of the patterns, since 
the number of patterns is much greater than its storage capacity. The network 
trained with the improved Hebbian rule can, on average, correctly retrieve 18 
patterns, which is close to its theoretical capacity. The perceptron-type rule can 
correctly retrieve all the patterns. It is noted that the results for the uncorrelated 
and slightly correlated cases are very close to each other for all these algorithms. 


Example 7.2: Now, let us increase N to 50; the corresponding results are listed 
in Table 7.2 and shown in Fig. 7.2. The average capacity of perceptron-like 
learning is 38 for training 50 epochs. By increasing the number of epochs to 100, 
the storage capability of perceptron-like learning can be further improved, the 
average iteration for the retrieval stage is 1, and the storage error rate is close to 
0. Thus, the storage capability can reach 50 for both the uncorrelated and slightly 
correlated cases. Perceptron-type learning can retrieve all the 50 patterns with 
a small storage error rate, while the improved Hebbian and genralized Hebbian 
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Table 7.1. Comparison of three associative memory algorithms (J = 50, N = 20). 


Algorithm 


GH 
IH 
PT 


Algorithm 
GH 


IH 
PT 


Training epochs 


4.70 


Training epochs 
1 

1 

4.50 


Performing iterations 


29.64 
5.66 
1 


Performing iterations 


23.28 
3.44 
1 


Uncorrelated 


0.2226 
0.0038 
0 


Correlated 
0.2932 


0.0026 
0 


Bit error rate 


Bit error rate 


Storage error rate 
0.9590 

0.0920 

0 


Storage error rate 
0.8660 

0.0680 

0 


GH—-generalized Hebbian rule, [H—improved Hebbian rule, PT—perceptron-type rule. 


Uncorrelated patterns: J=50, N=20 
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(b) 
Figure 7.1 Energy evolutions: J = 50, N = 20. (a) Uncorrelated patterns. (b) Slightly correlated 
patterns. t is the number of iterations. Notice that one epoch corresponds to 20 iterations. 
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Table 7.2. Comparison of three associative memory algorithms (J = 50, N = 50) 


Algorithm 
GH 

IH 

PT 
Algorithm 
GH 


IH 
PT 


7.4 


Uncorrelated 

Training epochs Performing iterations Bit error rate Storage error rate 
1 30 0.2942 1.00 

1 30 0.3104 0.9720 

35.44 17.90 0.0280 0.0688 

Correlated 

Training epochs Performing iterations Bit error rate Storage error rate 
1 30 0.3072 0.9848 

1 30 0.3061 0.9688 

35.98 14.92 0.0235 0.0568 


rules actually fail. The perceptron-type rule can almost correctly retrieve all the 
patterns; this accuracy can be further improved by training with more epochs. 


Increasing storage capacity 


When the Hopfield network is used as associative memory, there are cases where 
the fundamental memories are not stable. In addition, spurious states, which are 
other stable states different from the fundamental memories and their negative 
counterparts, may arise [2]. The Hopfield network trained with the generalized 
Hebbian rule can have a large number of spurious states, depending exponen- 
tially on N, the number of fundamental memories, even in the case when these 
vectors are orthogonal [12]. These spurious states are the corners of the unit 
hypercube that lie on or near the subspace spanned by the N fundamental mem- 
ories. The presence of spurious states and limited storage capacity are the two 
major restrictions for the Hopfield network being used as associative memory. It 
has been proved that as long as N < J, J being the number of neurons in the 
network, the fundamental memories are stable in a probabilistic sense [2]. 

The Gardner conditions [30] are often used as a measure of the stability of the 
patterns. Associative learning can be designed to enhance the basin of attrac- 
tion for every pattern to be stored by optimizing these conditions. The Gardner 
algorithm [30] combines maximal storage with a predefined level of stability for 
the patterns. Based on the Gardner conditions, the inverse Hebbian rule [22, 21] 
is given by 

Wij = (R); 


ij P 


(7.13) 


: : es N T 
where the correlation matrix R = 7 X p=1 Lp@p - 


Unlike the generalized Hebbian rule, which can only store unbiased patterns, 
the inverse Hebbian rule is capable of storing N patterns, biased or unbiased, in 
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Figure 7.2 Energy evolutions: J = 50, N = 50. (a) Uncorrelated patterns. (b) Slightly 
correlated patterns. t is the number of iterations. Notice that one epoch corresponds 
to 50 iterations. 


a Hopfield network of N neurons. The patterns have zero basins of attraction, 
and R must be nonsingular. Matrix inversion can be implemented using a local 
learning algorithm. The inverse Hebbian rule provides ideal initial conditions for 
any algorithm capable of increasing the pattern stability. 

Nonmonotonic activation functions are much more beneficial to the enlarge- 
ment of the memory capacity of a neural network model. In [41], by using the 
Gardner algorithm for training the weights and using a nonmonotonic activation 
function 


+1, net € (—oo, —b1) U (0, b1) 


Oe) S net € (b1, œ) U (—b1,0) ? oR 


the storage capacity of the network can be made to be always larger than 2J 
and reach its maximum value of 10.5J when bı = 1.22. In [58, 81], a continu- 
ous nonmonotonic activation function is used to improve the performance of the 
Hopfield network. The exact form of the nonmonotonic activation and its param- 
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eters are not very critical. The storage capacity of the Hopfield network can be 
improved to approximately 0.4J, and spurious states can be totally eliminated. 
When it fails to recall a memory, a chaotic behavior will occur. In the application 
to realizing the autocorrelation associative memory, the chaotic neural network 
model with sinusoidal activation functions possesses a large memory capacity as 
well as a remarkable ability of retrieving the stored patterns, better than the 
chaotic model with only monotonic activation functions such as sigmoidal func- 
tions [60]. It is shown in [49] that any finite-dimensional network model with 
periodic activation functions and properly selected parameters has much more 
abundant chaotic dynamics that truly determine the model’s memory capacity 
and pattern-retrieval ability. 

The eigenstructure learning rule [48] is developed for continuous-time Hopfield 
models in linear saturated mode. The design method allows linear combinations 
of the prototype vectors to be stored as asymptotically stable equilibrium points 
as well. The storage capacity is better than those of the pseudoinverse solution 
and the generalized Hebbian rule. All the desired patterns are guaranteed to 
be stored as asymptotically stable equilibrium points. The method has been 
extended to discrete-time neural networks in [57]. 

A quantum learning algorithm, which is a combination of quantum compu- 
tation with the Hopfield network, has been developed in [78]. The quantum 
associative memory has a capacity that is exponential in the number of neu- 
rons, namely, offering a storage capacity of O (27 ). It employs simple spin-1/2 
(two-state) quantum systems and represents patterns as quantum operators. 

Complex-valued associative memories such as complex-valued Hopfield net- 
works are used for storing complex-valued patterns. In [61], a complex percep- 
tron learning algorithm has also been studied for associative memory by using 
complex weights and a decision circle in the complex plane for the output func- 
tion. 


Other associative memories 
Morphological associative memories involve a very low computational effort in 
synthesizing the weight matrix by use of minimax algebra. An exact characteriza- 
tion of the fixed points and the basins of attraction of gray-scale autoassociative 
morphological memories is made in terms of the eigenvectors of the weight matrix 
in [74]. The set of fixed points consists exactly of all linear combinations of the 
fundamental eigenvectors. Morphological associative memories with threshold 
can be viewed as a special case of implicative fuzzy associative memories [73]. In 
particular, if n-dimensional patterns are binary, then 2” patterns can be stored. 
For the generalized BSB model [36], the synthesis for optimal performance 
is performed in [63], given a set of desired binary patterns to be stored as 
asymptotically stable equilibrium points. The synthesis problem is formulated 
as a constrained optimization problem, which can be converted into a quasi- 
convex optimization problem (generalized eigenvalue problem) in the form of an 
LMI (linear matrix inequality)-based optimization problem. In [13], the design 
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of recurrent associative memories based on the generalized BSB model is formu- 
lated as a set of independent classification tasks which can be efficiently solved 
by using a pool of SVMs. 

The storage capacity of recurrent attractor neural networks with sign- 
constrained weights was investigated in [7]. 


Multistate Hopfield networks for associative memory 


In Section 6.6, we have described multistate Hopfield networks. In [27], a multi- 
level Hopfield network modifies the generalized Hebbian rule. The storage capa- 
bility of the multilevel Hopfield network is proved to be O(J?) bits for a network 
of J neurons, which is of the same order as that of the Hopfield network [1]. 
Given a network of J neurons, the number of patterns that the multilevel net- 
work can reliably store and retrieve may be considerably less than that for the 
Hopfield network, since each codeword in the multilevel Hopfield network typi- 
cally contains more bits. In [70], a storage procedure for the multilevel Hopfield 
network in the synchronous mode is derived based on the LS solution, and also 
examined by using an image restoration example. 

The complex-valued multistate Hopfield network [39, 59] employs the mul- 
tivalued complex-signum activation function that is defined as an L-stage 
phase quantizer for complex numbers. In order to store a set of N patterns, 
{x;} C {0,1,...,L—1}/%, ax; is first encoded to its complex memory state 
E€ = (6:1; sae seig)” with 


Ei j = Zed, (7.15) 
The decoding of a memory state to a pattern is the inverse of (7.15). The 
complex-valued pattern set {e;} can be stored in weights by the generalized 
Hebbian rule [35] 


Wii = 


N 
N eip bj =L, 2ed, (7.16) 
p=1 


Ye 


where superscript * denotes conjugate operation. Thus, W is Hermitian. 

The storage capability of the memory, Nmax, is dependent upon the resolution 
L for an acceptable level of the error probability Pmax. As L increases, Nmax 
decreases, but each pattern contains more information. 

Due to the use of generalized Hebbian rule, the storage capacity of the network 
is very low and the problem of spurious memories is very pronounced. In [46], a 
gradient-descent learning rule has been proposed to enhance the storage capacity 
and also reduce the number of spurious memories. In [59], an LP method has 
been proposed for storing into the network each pattern in an integer set M C 
{0,1,2,...,—1}/ asa fixed point. A set of inequalities are employed to render 
each memory pattern as a strict local minimum of a quadratic energy landscape, 
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Figure 7.3 Architecture of the J-N-J MLP based recurrent correlation associative memory. 


7.6 


and the LP method is employed to obtain the weight matrix and the threshold 
vector. The LP method significantly reduces the number of spurious memories. 

Since gray-scale images can be represented by integer vectors, reconstruction 
of such images from their distorted versions constitutes a straightforward appli- 
cation of multistate associative memory. The complex-valued Hopfield network 
is particularly suitable for interpreting images transformed by two-dimensional 
Fourier transform and two-dimensional autocorrelation functions [39]. 


Multilayer perceptrons as associative memories 


Most recurrent network based associative memories have low storage capacity 
as well as poor retrieval ability. Recurrent networks exhibit asymptotic behavior 
and as such are difficult to analyze. MLP-based autoassociative memories with 
equal numbers of input and output nodes have been introduced to overcome 
these limitations [17, 79]. 

The recurrent correlation associative memory uses a J-N-J MLP-based recur- 
rent architecture [17], as shown in Fig. 7.3. Notice that the number of hidden 
units is taken as N, the number of stored patterns. At each time instant, the 
hidden layer computes an intermediate mapping, while the output layer com- 
pletes an association of the input pattern to an approximate prototype pattern. 
The approximated pattern is fed back to the network and the process continues 
until convergence to a prototype is achieved. The activation function for the ith 
neuron in the hidden layer is ¢;(-), and the activation function at the output 
layer is the signum function. 

The matrix W“), a J x N matrix, is made up of the N J-bit bipolar mem- 
ory patterns æ; i = 1,2,...,.N, that is, W® = [a1,a9,...,a@n]. And W®? = 
[wo]". When presenting pattern æ, the net input to neuron j in the hidden 


layer is net? = z7 g. We have 


N 
r(t +1) = sgn 5 Qj (aj x(t) eg. | (7.17) 


j=1 
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The correlation of two patterns, xT æ = J — 2dy (x1, £2), where dy/(-) is the 
Hamming distance between two binary vectors within, which is the number of 
bits in the two vectors that do not match each other. 

In the case when all ¢;(net) = ¢(net), (net) being any continuous, monotonic 
nondecreasing weighting function over [—J, J], (7.17) is proved to be asymptoti- 
cally stable in both the synchronous and asynchronous update modes [17]. This 
property is especially suitable for hardware implementation, since there are faults 
in the manufacture of any physical device. 

When all ¢;(net) = net, the recurrent correlation associative memory model 
is equivalent to the correlation-matrix associative memory [42, 8], that is, the 
connection corresponding to the case of the Hopfield network can be written as 
W = ee ea By suitably selecting ¢;(-), the model is reduced to some exist- 
ing associative memories, which have a storage capacity that grows polynomially 
or exponentially with J [17]. 


In particular, when all ¢;(net) = a” 


with radix a > 1, an exponential corre- 
lation associative memory [17] is obtained. The exponential activation function 
stretches the ratios among the weights and makes the largest weight more over- 
whelming. This significantly increases the storage capacity. The exponential cor- 
relation associative memory exhibits an asymptotic storage capacity that scales 
exponentially with J. Under noise-free condition, this storage capacity is 27 pat- 
terns [17]. A VLSI chip for this memory has been fabricated and tested [17]. 
The multi-valued recurrent correlation associative memory [18] can increase the 
error-correction capability with large storage capability and less interconnection 
complexity. 

The local identical index model [79] is an autoassociative memory model that 
uses the J-N-J MLP architecture. The weight matrices W® and W) are the 
same as those defined in the recurrent correlation associative memory model. 
It utilizes the signum activation function and biases in both the hidden and 
output layers. The local identical index model utilizes the local characteristics 
of the fundamental memories through two metrics, namely, the global identical 
index and the local identical index. Using the minimum Hamming distance as 
the underlying association principle, the scheme can be viewed as an approxi- 
mate Hamming decoder. The local identical index model exhibits low structural 
as well as operational complexity. It is a one-shot associative memory, and can 
accommodate up to 27 prototype patterns. This model outperforms the lin- 
ear system in a saturated mode [48] and its discrete version [57] in recognition 
accuracy at the presentation of the corrupted patterns, controlled by using the 
Hamming distance. It can successfully associate input patterns that are even 
loosely correlated with the corresponding prototype pattern. 

For a J-J2-J MLP-based autoassociative memory, the hidden layer is a bot- 
tleneck layer with fewer nodes, Jz < J. This bottleneck layer is used to discover 
a limited set of unique prototypes that cluster the training set. The neurons at 
the bottleneck layer use the sigmoidal function. 
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Figure 7.4 Architecture of the J-N-N Hamming network. 


7.7 


In a three-layer MLP with a feedback connection that links input and output 
layers through delay elements [4], the MLP is initially trained to store the real- 
valued patterns as an autoassociative memory. A sigmoidal function is chosen 
for the hidden-layer nodes, while at the output layer there is a linear summation. 
The model has compact size to store numerous stable fixed points, but it is also 
able to learn asymmetric arrangement of fixed points, whereas the self-feedback 
neural network model is limited to orthogonal arrangements. 


Hamming network 


The Hamming network [50] is a straightforward associative memory. It calculates 
the Hamming distance between the input pattern and each memory pattern, and 
selects the memory with the smallest Hamming distance. The network output 
is the index of a prototype pattern and thus the network can be used as a pat- 
tern classifier. The Hamming network is used as the classical Hamming decoder 
or Hamming associative memory. It provides the minimum-Hamming-distance 
solution. 

The Hamming network has a J-N-N layered architecture, as illustrated in 
Fig. 7.4. The activation function at each of the units in each layer, denoted 
by vector @, is the signum function, and 6°) is a vector comprising the biases 
for all neurons at the hidden layer. The weights in the third layer T = [ti], 
ea oa lear 

The third layer is called the memory layer, each of whose neurons corresponds 
to a prototype pattern. The input and hidden layers are feedforward, fully con- 
nected, while each hidden node has a feedforward connection to its corresponding 
node in the memory layer. Neurons in the memory layer are fully interconnected, 
and form a competitive subnetwork known as the MAXNET. The MAXNET 
responds to an input pattern by generating a winner neuron through iterative 
competitions. The Hamming network is implicitly recurrent due to the intercon- 
nections in the memory layer. 
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The second layer generates matching scores that are equal to J minus the 
Hamming distances to the stored patterns, that is, J — dy (Œ, xi), i = 1,..., N, 
for pattern x. These matching scores range from 0 to J. The unit with the highest 
matching score corresponds to the stored pattern that best matches the input. 
The weights between the input and hidden layers and the biases of the hidden 
layer are, respectively, set as 


j=l,...,Ni=1,...,c. (7.18) 


All the thresholds and the weights tı in the MAXNET are fixed. The thresh- 
olds are set as zero. The weights from each node to itself are set as unity and 
weights between nodes are inhibitory, that is, 

1, k=l 
tki = oa par (7.19) 
where € < x. 


When a binary pattern is presented to the network, the network first generates 
an initial input for the MAXNET 


N 
y;(0) = (Some - sP) , j=l, N, (7.20) 
s= 


where ¢(-) is the threshold-logic nonlinear function. 
The input pattern is then removed and the MAXNET continues the iteration 


N N 
yey = (Stunt) =¢ly()-e >> mOj, j=1,..N, 
k=1 k= kJ 


(7.21) 
until the output of only one node is positive. This node corresponds to the 
selected class. 

The Hamming network implements the minimum error classifier, when the bit 
errors are random and independent. For a J-N-N Hamming network, there are 
J x N + N? connections, while for the Hopfield network the number of connec- 
tions is J?. When J >> N, the number of connections in the Hamming network 
is significantly less than that in the Hopfield network. In addition, the Hamming 
network offers a storage capacity that is exponential in the input dimension [33], 
and it does not have any spurious state that corresponds to a no-match result. 
Under the noise-free condition, the Hamming network has a storage capacity 
of 27 patterns [33]. For a sufficiently large but finite radix a, the exponential 
correlation associative memory operates as a Hamming associative memory [33]. 

The Hamming network suffers from difficulties in hardware implementation 
and low retrieval speed. Based on the correspondence between the Hamming 
network and the exponential correlation associative memory [33], the exponential 
correlation associative memory can be used to compute the minimum Hamming 
distance, in a distributed fashion by analog exponentiation and thresholding 
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devices. The two-level Hamming network [37] generalizes the Hamming memory 
by providing for local Hamming distance computations in the first level and 
a voting mechanism in the second level. It allows for a much more practical 
hardware implementation and a faster retrieval. 


Bidirectional associative memories 


BAM is created by adapting the nonlinear feedback of the Hopfield model to 
a heteroassociative memory [45]. It is used to store N bipolar pairs (xp, Yp), 
p=1,..., N, where £p = (zip, ---,Ehp) > Yp = (Yip,-- -,YJap)” , and Bip, Yjp € 
{+1,—1}. The first layer has Jı neurons and the second has J2 neurons. 

BAM learning is accomplished with a simple Hebbian rule. The weight matrix 
is 


W=YX" =X yx? (7.22) 


where X and Y are matrices that represent the sets of bipolar vector pairs to 
be associated. BAM effectively uses a signum function to recall noisy patterns, 
despite the fact that the weight matrix developed is not optimal. This one-shot 
learning rule leads to poor memory storage capacity, is sensitive to noise, and is 
subject to spurious steady states during recall. 

The retrieval process is an iterative feedback process that starts with X® in 
layer 1: 


YOt) = sen (WK), OH sen (WYO). (7.23) 


For any real connection matrix, one of fixed points (æ p, yp) can be obtained from 
this iterative process. A fixed point has the properties: 


Tf = sgn (Wy) , Yp =sgn (Wep). (7.24) 


With the iterative process, BAM can achieve both heteroassociative and autoas- 
sociative data recollections. The final state in layer 1 represents the autoassocia- 
tive recall, and the final state in layer 2 represents the heteroassociative recall. 

Kosko’s encoding method has the ability of incremental learning: encoding 
new pattern pairs to the model is based on only the current connection matrix. 
This method can correctly store up to se pattern pairs [47]. When the 
total number exceeds this, all pattern pairs may not be stored as fixed points. To 
avoid this, one can forget past pattern pairs such that BAM can correctly store 
as many as possible of the most recent learning pattern pairs. Forgetting learning 
is an incremental learning rule in associative memories. The storage behavior of 
BAM under the forgetting learning is analyzed in [47]. 

In [15], a BAM model is introduced that uses a simple self-convergent iterative 
learning rule and a nonlinear output function. In addition, the output function 
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enables it to learn and recall grey-level patterns in a bidirectional way. The 
heteroassociative neural network can learn bipolar and real-valued patterns. The 
network is able to stabilize its weights to fixed-point attractors in a local fashion. 
The model is immune to overlearning and develops fewer spurious attractors 
compared to other BAMs. 

The storage capacity of Kosko’s BAM is 0.151, while that of asymmetrical 
BAM [80], generalized asymmetrical BAM [26], and the model proposed in [15] is 
Jı, and that for general BAM [69] is greater than Jı. Global exponential stability 
criteria are established for BAM networks with time delays in [53]. 

Most BAM networks use a symmetrical output function for dual fixed-point 
behavior. In [16], by introducing an asymmetry parameter into a chaotic BAM 
output function, prior knowledge can be used to momentarily disable desired 
attractors from memory, hence biasing the search space to improve recall per- 
formance. This property allows control of chaotic wandering, favoring given sub- 
spaces over others. In addition, reinforcement learning can then enable a dual 
BAM architecture to store and recall nonlinearly separable patterns. The same 
BAM framework is allowed to model three different types of learning: supervised, 
reinforcement, and unsupervised. 


Cohen-Grossberg model 


A general class of neural networks defined in [19] is globally stable. The general 
form of the model is 


d J 
ae = ailxi) bi(xi) = De ij dj (2;) ; (7.25) 


where J is the number of units in the network, x = {x1, 22,...,2 7} is the state of 
the network, C = [c;,;] is a coefficient matrix, a;(-) is the amplification function, 
b;(-) is the self-signal function, and d;(-) is the other signal function [32]. 


Theorem 7.3 (Cohen-Grossberg theorem). If a network can be written 
in the general form (7.25) and also obeys the three conditions, (C1) symmetry: 
Cij = Cji, (C2) positivity: aj(x;) > 0, (C3) monotonicity: the derivative dj (xj) > 
0, then its Lyapunov function can be defined by 


-5 f" i(éi)d 4 (Ei) déi + oa x; ) dk (Lx). (7.26) 


41 kel 


Many neural networks fall in this class of models [32]: e.g., BSB [9] and the 
Boltzmann machine. The global stability of the continuous-time, continuous- 
state BSB dynamical systems with real symmetric weight matrices is proved in 





[19]. Nonlinear dynamic recurrent associative memory [15], [14] is an instance 
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of the Cohen-Grossberg model when 6 < 1/2 [34], and analysis of the energy 
function shows that the transmission is stable in the entire domain of the model 
[34]; it is a nonlinear synchronous attractor neural network, converging to a set 
of real-valued attractors in single-layer neural networks [14] and bidirectional 
associative memories [15]. 

Applications of networks with time delays often require that the network has 
a unique equilibrium point which is globally exponentially stable, if the network 
is to be suitable for solving problems in real time. In hardware implementation 
of neural networks, time delays even time-varying delays in neuron signal trans- 
mission or processing are often inevitable. It is more realistic to design neural 
networks which are robust on delays. 


Cellular networks 


Cellular networks are suitable for hardware implementation and, consequently, 
for their employment in applications such as in real-time image processing and 
in construction of efficient associative memories. Adjustments of cellular network 
parameters is a complex problem involved in the configuration of cellular network 
for associative memories. 

A cellular network is a high-dimensional array of cells that are locally inter- 
connected. The neighborhood of a cell c(i, j) is denoted by 


V,(4, j) = {c(k, 1) : max(|k — il, l- j) <r}, k=1,...,M,l=1,...,N, 
(7.27) 
where the subscript r indicates the neighborhood radius around the cell and 
M x N is the total number of cells in the array. 
The dynamics of a cellular network are given by 


t= —x + Ty +I, (7.28) 
y = sat(x), (7.29) 
where x = (z1, £2, ..., £n)? € R” denotes the cell states, n = M x N represents 


the number of network cells, y = (y1, Y2; ---, Yn)! € R”, yi € [-1,1],i=1,...,n, 
is the cell outputs (bipolar), T € R”*” represents the interconnection matrix 
(sparse), I € R” is a vector of bias, and sat(x) = (sat(a1),...,sat(x,))7 is a 
saturation function, 


—1, if xz; < -—1 
ip ift; > 1 


Cellular networks for image processing 
For image processing, cellular networks are generally used for movement detec- 
tion, contour extraction, image smoothing, and detection of directional stimuli. 
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An operation is in general applied to each image pixel (cell). Noise and contours 
can produce crisp variations. Image smoothing must eliminate noise and preserve 
contours, and this can be achieved using Laplacian operator. 

For image smoothing, the image is provided as initial condition æ(0) of the 
network and I must be zero. Output y(t) represents the processed image, when 
t— o. 


Example 7.3: In order to illustrate the elimination of noise by a cellular network, 
an original image (Fig. 7.5a) is corrupted by Gaussian noise of variance 0.01 
generating the image shown in Fig. 7.5b. 

The range of x;; (0) is [—1.0, 1.0] to codify pixel intensities in a grayscale, with 
—1.0 corresponding to white and 1.0 to black. The neighborhood can be stored by 
a 3 x 3 matrix, which corresponds to a neighorhood of r = 1. Equations (7.28) 
and (7.29) can be numerically integrated using the fourth-order Runge-Kutta 
algorithm for the time period [0,0.1] and for 10 iterations, and the bias is set to 
0. 

In a cellular network, for each pixel of an image Y, T can be determined from 
a mask A with an eighborhood of radius 1: 


A_11 A-1,0 A-11 0 10 
T= | Ao-1 Ao, Aoi | = {1-41 
Ai-1 Ato Ais 0 10 


The central position (Ao,o) denotes the connection from a cell (i,j) to itself. 
The representation of T using a mask highlights the relationship of a cell and 
its neighbors. Moreover, the mask synthesizes how a cellular network processes 
signals. 

Figure 7.5c presents the image obtained by the application of the Laplacian 
operator. Figure 7.5d presents the image obtained by using Laplacian operator 
directly on the noisy image. It is seen that image smoothing using the cellular 
network is better than conventional way of applying image operator. 


Cellular networks for associative memories 

To use cellular networks as associative memories, parameters T and I need 
to be properly adjusted. The equilibrium points of (7.28) correspond to the 
different patterns of the input to be stored. In the pseudoinverse method [31], 
pseudoinverse matrices are utilized for solving the equilibrium equations (7.28): 
For all patterns p = 1,...,N, 


a? = Ty? +I, (7.31) 


y” = sat(x?), (7.32) 
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(c) (a) 


Figure 7.5 Image smoothing using cellular network. (a) Original image. (b) Image corrupted by 
Gaussian noise; (c) Image restored using Laplacian operator implemented in a cellular network. (d) 
Image obtained by applying Laplacian operator directly on the noisy image. 


where y? € {—1,1}” represents the pth pattern to be stored and æ? are the 
respective equilibrium points. 

Other methods are SVD-based [51], Hebbian learning based [75], perceptron- 
based [52], and LMI-based [64] methods. The five algorithms and their compar- 
ision are given in [25]. LMI and Hebbian methods show superior performance 
for tests involving binary noise. For tests with Gaussian noise, perceptron, SVD, 
and Hebbian approaches present similar performance with almost 100% of pat- 
tern retrieval. The LMI method is not adequate for cellular networks with small 
neighbor radius. The pseudoinverse method presents poor performance for cel- 
lular networks with larger neighbor ratio. In general, these approaches present 
better performance for patterns corrupted with Gaussian than binary noise. 

Associative memories can be synthesized based on discrete-time recurrent net- 
works [82] or continuous-time cellular networks with time delays [83]. The design 
procedure enables both hetero- and autoassociative memories to be synthesized. 
In [82], the designed memories have high storage capacity and assure global 
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asymptotic stability. As typical representatives, discrete-time cellular networks 
designed with space-invariant cloning templates are examined in detail. The 
storage capacity of the designed associative memories is as high as 2J bipolar 
patterns [82]. In [83], the synthesizing procedure solves a set of linear inequali- 
ties with few design parameters and retrieval probes feeding from external inputs 
instead of initial states. The designed associative memories are robust in terms of 
design parameter selection. In addition, the hosting cellular networks are guar- 
anteed to be globally exponentially stable. 


7.1 For the network trained in Example 7.1, randomly flip up to 10% of the 
J bits, and then retrieve using the learned network. Calculate the bit error rate 
and storage error rate for the case of the three algorithms. 


7.2 Find the weights and thresholds for a Hopfield network that stores the 
patterns 0101, 1010, 0011, 0110. 


7.3 Store three fundamental memories into a five-neuron Hopfield network: 


&i = ( 1, 1 H T, H L 1), £2 = ( 1, 1, 1, H 1, 1), &3 = (+1, —1,—1, +1, +1). 








The hamming distance between these memories are 2 or 3. 

a) Solve for the 5 x 5 weight matrix. 

b) Verify that the three fundamental memories can be correctly retrieved using 
asynchronous updating. What about synchorous updating? 

c) When presenting a noisy version of s, with one element’s polarity being 
reversed, verify the retrieval performance. 

d) Show that —£€,, —€, —&3 are also fundamental memories. 

e) If the second element of £s is unknown at retrival stage, find out the retrival 
result. 

f) Calculate the energy of the network. 


7.4 Consider the Hopfield network. 

a) Plot storage capacity versus the number of neurons N for N up to 20. Con- 
sider the cases for different stoage algorithms and for different bounds. 

b) If we want to store 12 patterns, how many neurons are needed? 





7.5 Write a program to implement the Hamming network. Use the program to 
retrieve ten 5 x 7 numeric digits. 


7.6 The BSB model can be characterized by 
y(n) = x(n) + SWa(n), 


a(n +1) = y(y(n)), 


where ( is a small positive constant, W is a symmetric weight matrix whose 
largest eigenvalues have positive real components, the activation function y(x) 
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is a piecewise-linear function which operates on the components of a vector, and 
it is +1 if x > +1, —1ifx < —1, and is linear inbetween. The Lyapunov function 
for the BSB model is E = —SaT We. Prove the stability of the BSB model by 
using the Cohen-Grossberg theorem. 


7.7 Store five 100 x 100 images into a Hopfield network. For example, you can 
store the pictures of five different shapes, or different objects. 

(a) Add noise to one of the picture, and retrieve the correct object. 

(b) Erase some parts of the picture, and then retrieve the correct object. 


7.8 Consider the BAM model. 
(a) Train it using samples (0010010010, 01), (0100111101, 10). 
(b) Test the trained network, with a corrupted sample. 


7.9 This problem is adapted from [15]. Convert the 26 English characters from 
lower case to upper case. The network associates 26 correlated patterns consisting 
of 7 x 7 binary images. Figure 7.6 illustrates the stimuli used for the simulation. 
(a) Check whether Kosko’s BAM can handle the storage. Notice that the images 
were converted into vectors of 49 dimensions and this corresponds to a memory 
load of 53% (26/49) of the space capacity. 

(b) Select another BAM algorithm with higher storage capacity. 

(c) Test the network performance on a noisy recall task. The task was to recall 
the correct associated stimulus from a noisy input obtained by randomly flipping 
from 0 to 10 pixels in the input pattern, corresponding to a noise proportion of 
0 to 20%. 


7.10 The Laplacian of a function w(z, y) in the plane is given by A?q = Zy + 
Ow 


Dyer: For an image w to be processed by a cellular network, verify that the mask 
0 10 

for each cell (pixel) is A = | 1 —4 1 |. [Hint: Apply an approximation of central 
0 10 


differences by Taylor series expansion. ] 
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8.1 


8.1.1 


Clustering |: Basic clustering 
models and algorithms 


Introduction 


Clustering is a fundamental tool for data analysis. It finds wide applications 
in many engineering and scientific fields including pattern recognition, feature 
extraction, vector quantization, image segmentation, bioinformatics and data 
mining. Clustering is a classical method for the prototype selection of kernel- 
based neural networks such as the RBF network, and is most useful for neuro- 
fuzzy systems. 

Clustering is an unsupervised classification technique that identifies some 
inherent structure present in a set of objects based on a similarity measure. 
Clustering methods can be derived from statistical models or competitive learn- 
ing, and correspondingly they can be classified into generative (or model-based) 
and discriminative (or similarity-based) approaches. A clustering problem can 
also be modelled as a COP. Clustering neural networks are statistical models, 
where a probability densty function (pdf) for data is estimated by learning its 
parameters. 

Chapters 8 and 9 are dedicated to clustering. Our emphasis is placed on a 
number of competitive learning based neural networks and clustering algorithms. 
In this chapter, we describe the SOM, learning vector quantization (LVQ) and 
ART models, as well as C-means, subtractive and fuzzy clustering algorithms. 
Chapter 9 deals with many associated topics such as the underutilization prob- 
lem, robust clustering, hierarchical clustering and cluster validity. Kernel-based 
clustering is introduced in Chapter 17. 


Vector quantization 


Vector quantization is a classical method that produces an approximation to a 
continuous pdf p(a) of the vector variable x € R” using a finite number of pro- 
totypes. That is, vector quantization represents a set of feature vectors æ by a 
finite set of prototypes {c1,...,cK} C R”. The finite set of prototypes is referred 
to as the codebook. Codebook design can be performed by using clustering algo- 
rithms. Once the codebook is specified, the approximation of a involves finding 
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Figure 8.1 Voronoi tessellation in two-dimensional space. Codebook vectors are denoted by black 


points. 


the reference vector c closest to x such that 
æ — el] = min ||æ — cil . (8.1) 


This is the nearest-neighbor paradigm, and the procedure is actually the simple 
competitive learning. 

The codebook can be designed by minimizing the expected squared quantiza- 
tion error 


B= / |æ — ell"? p(x)da, (8.2) 


where c satisfies (8.1), that is, c is a function of æ and ci. 
An iterative approximation scheme for finding the codebook is derived from 
criterion (8.2) [73] 


ei(t + 1) = cit) + nt) ðwi (w(t) — ex(4)], (8.3) 


where subscript w denotes the index of the prototype closest to x(t), termed the 
winning prototype, Ôwi is the Kronecker delta (wi = 1 for w = i, and 0 other- 
wise), and 7 > 0 is a small learning rate, satisfying the classical Robbins-Monro 
conditions 


X ntt) =oo and Sor (t) < oo. (8.4) 


Typically, 7 is selected to be decreasing monotonically in time. For instance, one 
can select n(t) = jo (1 — 4), where 7o € (0, 1] and T is the maximum number of 
iterations. 

Voronoi tessellation, also called a Voronoi diagram, is useful for the illustration 
of vector quantization results. The space is partitioned into a finite number of 
regions bordered by hyperplanes. Each region is represented by a codebook vec- 
tor, which is the nearest neighbor to any point within the same region. An illus- 
tration of Voronoi tessellation in the two-dimensional space is shown in Fig. 8.1. 
All vectors in one of the regions constitute a Voronoi set. For a smooth under- 
lying probability density p(x) and large K, all regions in an optimal Voronoi 
partition have the same within-region variance o? [47]. 
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Figure 8.2 Architecture of the J-K competitive learning network. 


8.1.2 


Competitive learning 


Competitive learning can be implemented using a J-K neural network. The 
output layer is called the competition layer whose neurons are fully connected to 
the input nodes. In the competition layer, lateral connections are used to perform 
lateral inhibition. The architecture of the competitive learning network is shown 
in Fig. 8.2. For input x, the network selects one of the K prototypes (weights) 
ci by setting y; = 1 and y; = 0, j Fi. 

The basic principle underlying competitive learning is the mathematical statis- 
tics problem called cluster analysis. Competitive learning is usually based on the 
minimization of a functional such as 


1 N K 
B= 9 D ule- el, (8.5) 


p=1 k=1 
where N is the size of the pattern set, and Hkp is the connection weight assigned 
to prototype Ck with respect to £p, denoting the membership of pattern p in 
cluster k. 

Minimization of (8.5) can lead to batch algorithms, but it is difficult to apply 
the gradient-descent method, since the winning prototypes must be determined 
with respect to each pattern x,. By using the functional 


K 
Ep = X hrp ||£p — exll” , (8.6) 
k=1 


the gradient-descent method leads to sequential updating of the prototypes with 
respect to pattern £p. When cx is the winning prototype of x, in terms of the 
Euclidean metric, Upp = 1; otherwise, pz, < 1. 

Simple competitive learning is derived by minimizing (8.5) under the assump- 
tion that the weights are obtained according to the nearest-prototype condition 


_ f1, k= arg, min ||zp — c;|l 
oe . otherwise ` En 
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Thus (8.5) becomes 


N 
1 ; 2 
E = W 2 { mn, £p — cell \ . (8.8) 


This is the average of the squared Euclidean distances between the inputs £p 
and their closest prototypes ck. The minimization of (8.8) implies that each 
input attracts only its winning prototype and has no effect on its nonwinning 
prototypes. 

Based on the squared error criterion (8.6) and the gradient-descent method, 
assuming C,,(t) to be the winning prototype of a;, we get the simple competitive 
learning as 


Cw(t + 1) = ew(t) + n) [we — Cw(t)], (8.9) 
c(t ar 1) = i(t), i x w, (8.10) 


where 1(t) can be selected according to (8.4). The process is known as winner- 
takes-all (WTA). In the WTA process, agents in a group compete with each 
other and only the one with the highest input stays active while all the others 
are deactivated. This phenomenon widely exists in nature and society. The WTA 
mechanism plays an important role in the design of unsupervised learning neural 
networks. If each cluster has its own learning rate as ņ; = re where N; is the 
number of samples assigned to the ith cluster, the algorithm achieves the mini- 
mum output variance [117]. k-winners-take-all (k-WTA) is a process of selecting 
k winners from the codebook for a training sample æ+. 

Winner-kill-loser is another rule for competitive learning, and has been applied 





to the neocognitron [45]. Every time when a training sample is presented, non- 
silent cells compete with each other. The winner not only takes all, but also 
kills losers. In other words, the winner learns the training sample, and losers are 
removed from the network. If all cells are silent, a new cell is generated and it 
learns the training sample. 


Example 8.1: For the iris data set, we use simple competitive learning model 
to classify the data set. We set the number of training epochs to 100, and the 
learning rate to 0.3. The original classification and the output of the neural 
network classification are shown in Fig. 8.3. The rate of correct classification is 
89.33%. 


Self-organizing maps 


The different regions of the cerebral cortex respond to different sensory inputs 
(e.g. visual-, auditory-, motor-, or somato-sensory), and topographically ordered 
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Figure 8.3 Classification result for the iris data set. 


mappings are widely observed in the cortex. A cytoarchitectural map of the 
cerebral cortex is presented in Fig. 8.4 [15]. The primary sensory regions of 
the cortical maps are established genetically in a predetermined manner, and 
more detailed associative areas between the primary sensory areas are gradually 
developed through topographical self-organization during life [69]. There exist 
two main types of brain maps [75]: pointwise ordered projections from a receptive 
surface onto a cortical area (e.g., the somatotopic and visual maps), and abstract 
or computational maps, which are ordered along with some sensory feature value 
or a computed entity (e.g., the color map in area 4 of the visual cortex and the 
target-range map in the mustache bat auditory cortex). 

Von der Malsburg’s line-detector model [111] and Kohonen’s SOM [67] are 
two well-known topology-preserving competitive learning models. They are of 
abstract or computational maps. The line-detector model is based on fixed exci- 
tatory and inhibitory lateral connections and the Hebbian rule of synaptic plas- 
ticity of the afferent connections; however, the natural signal patterns are usually 
more complex. SOM models the sensory-to-cortex mapping, and is an unsuper- 
vised, associative memory mechanism. In [75], a pointwise-ordered projection 
from the input layer to the output layer is created in a self-organized fashion 
relating to SOM. If the input layer consists of feature detectors, the output layer 
forms a feature map of the inputs. 

SOM is well known for its ability to perform clustering while preserving topol- 
ogy. It compresses information while preserving the most important topological 
and metric relationships of the primary data elements. SOM can be regarded as 
competitive learning with a topological constraint. It is useful for vector quantiza- 
tion, clustering, feature extraction, and data visualization. The Kohonen learning 
rule is a major development of competitive learning. 
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Figure 8.4 Cytoarchitectural map of the cerebral cortex. The different areas of the cerebral cortex 

have different layer thickness and types of cells. Some of the sensory areas are motor cortex (area 4), 
premotor area (area 6), frontal eye fields (area 8), somatosensory cortex (areas 1, 2, 3), visual cortex 
(areas 17, 18, 19), and auditory cortex (areas 41, 42). ©A. Brodal, 1981, Oxford University Press [15]. 











Figure 8.5 Architecture of the two-dimensional J-K Kohonen network. 


8.2.1 Kohonen network 


The Kohonen network is a J-K feedforward structure with fully interconnected 
processing units that compete for signals. The output layer is called the Kohonen 
layer. Input nodes are fully connected to output neurons with their associated 
weights. Lateral connections between neurons are used as a form of feedback 
whose magnitude is dependent on the lateral distance from a specific neuron, 
which is characterized by a neighborhood parameter. 

The Kohonen network defined on R” is a one-, two-, or higher-dimensional 
grid A of neurons characterized by prototypes cz, E€ R” [69, 68]. cz, can also be 
viewed as the weight vector to neuron k. The architecture of the network in the 
two-dimensional space is illustrated in Fig. 8.5. 
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The Kohonen network uses competitive learning. Patterns are presented 
sequentially in time through the input layer, without specifying the desired out- 
put. The Kohonen network is extended to SOM when the lateral feedback is 
more sophisticated than the WTA rule. For example, the lateral feedback used 
in SOM can be selected as the so-called Mexican hat function, which is observed 
in the visual cortex. 


Basic self-organizing maps 


SOM employs the Kohonen network topology. An SOM not only categorizes the 
input data, but also recognizes which input patterns are nearby one another 
in stimulus space. For each neuron k, compute the Euclidean distance to input 
pattern a, and find the neuron whose prototype is closest to x: 


= = mi = .11 
læs — cwl] = min |æ — exl, (8.11) 


where subscript w denotes the winning neuron, called the excitation center, 
which becomes the center of a group of input vectors that lie closest to Cw. 
For all the input vectors closest to Cw, update all the propotype vectors by 


cx (t +1) = cx (t) + n(t)hku (t) [we — celt], k=1,..., K, (8.12) 


where 7(t) is selected according to (8.4), and hyw(t) is the so-called excitation 
response or neighborhood function, which defines the response of neuron k when 
Cw is the excitation center. Equation (8.12) is known as the Kohonen learning 
rule [69]. 

If hew(t) = 1 for k = w and 0 otherwise, (8.12) reduces to simple competitive 
learning. hkw(t) can be selected as a function that decreases with an increasing 
distance between ck and Cw, and is typically selected as the Gaussian function 

_ len -ewl|? 
hiw(t) = hoe P0, (8.13) 
where ho > 0 is a constant. In SOM, the topological neighborhood shrinks with 
time, thus o(t) is a decreasing function of t, and a popular choice is the expo- 
nential decay with time [97] 


a(t) = ooe" F, (8.14) 


where go is a positive constant and 7 is a time constant. 
Another popular neighborhood function is the Mexican-hat function. A 
Mexican-hat function is well-suited to bipolar stimuli and is described by 


1 lep-cwl? lep-ewl? 
hiw) = z% (se “7 eat ) ; (8.15) 


where a(t) is defined by (8.14). The Gaussian and Mexican-hat functions are 
plotted in Fig. 8.6. 
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Algorithm 8.1 (SOM). 


1. Sett =0. 
2. Initialize all ck(0) and learning parameters (0), ho, co, and T. 
3. Repeat until a criterion is satisfied: 

a. Present pattern x, at time t. 


. Select the winning neuron for x, by (8.11). 


. Update the prototypes for all neurons by (8.12). 
. Sett=t+1. 





The Gaussian topological neighborhood is biologically more reasonable than a 
rectangular one. SOM using the Gaussian neighborhood converges more quickly 
than SOM using a rectangular one [87]. SOM is given by Algorithm 8.1. 

The algorithm can be stopped when the map achieves an equilibrium with 
a given accuracy or when a specified number of iterations is reached. In the 
convergence phase, Awk can be selected as time invariant, and each prototype is 
recommended to be updated by using an individual learning rate 7, [73] 

Tk (t) 


Normalization of x is suggested since the resulting reference vectors tend to have 
the same dynamic range. This may improve the numerical accuracy [69]. 

After learning is completed, the network is ready for generalization. When a 
new pattern x is presented to the map, the corresponding output c is determined 
according to the mapping: x — c such that |/a—c|| = minea ||x — ¢,||. The 
mapping performs vector quantization of the input space into the map A. 

Compared with the symmetric neighborhood function, an asymmetric neigh- 
borhood function for SOM accelerates the ordering process of SOM [4], though 
this asymmetry tends to distort the generated ordered map. The number of 
learning steps required for perfect ordering in the case of the one-dimensional 
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SOM is numerically shown to be reduced from O(N?) to O(N?) with an asym- 
metric neighborhood function, even when the improved algorithm is used to get 
the final map without distortion. 

SOM is deemed to converge to an organized configuration in one- or higher- 
dimensional SOM with probability one. In the literature, there are some proofs 
for the convergence of one-dimensional SOM based on Markov-chain analysis 
[41]. However, no general proof of convergence for multidimensional SOM is 
available. SOM suffers from several major problems such as forced termination, 
unguaranteed convergence, nonoptimized procedures, and the output being often 
dependent on the sequence of data. SOM is not derived from any known objective 
function, and its termination is not based on optimizing any model of the process 
or its data. It is closely related to C-means clustering [86]. SOM is shown to be an 
asymptotically optimal vector quantization [120]. With neighbourhood learning, 
it is an error tolerant vector quantization [88] and a Bayesian vector quantization 
[89]. 

SOM with dynamic learning [26] improves SOM training on signals with sparse 
events which allows for more representative prototype vectors to be found, and 
consequently better signal reconstruction. The training rule is given by [26] 


cx (t +1) = ex (t) + n(t)hew (t)sgn(ay — erlt) ler- ex (t)||?, k=1,..., K. 
(8.17) 

Parameterless SOM [10] calculates the learning rate and neighborhood size 
based on the local quadratic fitting error of the map to the input space. This 
allows the map to make large adjustments in response to unfamiliar inputs, 
while making small changes in response to inputs it is already well adjusted 
to. It markedly decreases the number of iterations required to get a stable and 
ordered map. Parameterless SOM is measurably less ordered than a properly 
tuned SOM and edge shrinking is also more marked in parameterless SOM. It is 
able to handle input probability distributions that lead to failure of SOM. It is 
guaranteed to achieve ordering under certain conditions. 

Like classical vector quantization method, SOM was originally intended to 
approximate input signals or their pdfs by quantified codebook vectors that are 
localized in the input space to minimize a quantization error functional [69]. SOM 
is related to adaptive C-means, but performs a topological feature map that is 
more complex than just cluster analysis. The topology-preservation property 
makes SOM a popular choice in data analysis. However, SOM is not a good 
choice in terms of clustering performance compared to other popular clustering 
algorithms such as C-means [90], neural gas [92] and ART 2A [57]. Besides, for 
large output dimensions, the number of nodes in the adaptive SOM grid increases 
exponentially with the number of function parameters. The prespecified standard 
grid topology may not be able to match the structure of the distribution, and 
can thus lead to poor topological mappings. 

When M? is the size of a feature map, the number of compared weight vectors 
for one input vector to search a winner vector by exhaustive search is equiva- 
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Figure 8.7 Random data points in two-dimensional space. In each of the two quarters, there are 1000 
uniformly random points. 


lent to M?. In [78], the proposed SOM algorithm with O(log, M) complexity 
is composed of a subdividing method and a binary search method. Only win- 
ner vectors are trained. The algorithm subdivides the map repeatedly, and new 
nodes of weight vectors emerge in every step. 

Complex-valued SOM performs adaptive clustering of the feature vector in the 


complex-amplitude space [53]. 


Example 8.2: We implement vector quantization using SOM with a grid of cells. 
The data set is composed of 1500 random data points in the two-dimensional 
space: 500 uniformly random points in each of the three unit squares, as shown 
in Fig. 8.7. 

The link distance! is employed. All prototypes of the cells are initialized at the 
center of the range of the data set, namely, (1.5, 1.5). The ordering phase starts 
from a learning rate of 0.9 and decreases to the tuning-phase learning rate 0.02 
in 1000 epochs, and then the tuning phase lasts much longer time with a slowly 
decreasing learning rate. In the tuning phase, the neighborhood distance is set 
as 1. When training is completed, two points, pı = (0.8,0.6), py = (1.5, 2.8) and 
P3 = (2.1, 1.2), are used as test points. 

In the first group of simulations, the output cells are arranged in a 10 x 10 
grid. The hexagonal neighborhood topology is employed. The training results for 
10, 100, 1000 and 5000 epochs are shown in Fig. 8.8. At 5000 epochs, we tested 


1 The link distance between two points A and B inside a polygon P is defined to be the 
minimum number of edges required to connect A and B inside P. 
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Figure 8.8 ‘Two-dimensional SOM 


Pı, P2 and ps, and found that they, respectively, belong to the 25th, 93rd and 
27th clusters. 

In the second group of simulations, the output cells are arranged in a one- 
dimensional grid of 100 nodes. The corresponding results are shown in Fig. 8.9. 
In this case, pı, py and p3, respectively, belong to the 68th, 61rd and 48th 
clusters. 


Example 8.3: SOM can be applied to solve the TSP [42]. The process results in 
a neural encoding that gradually relaxes toward a valid tour. Assume that 40 
cities are randomly located in a unit square. The objective is to find the shortest 
route that passes through all the cities, each city being visited exactly once. No 
constraint is applied on the Kohonen network since the topology of the solution 
is contained in the network topology. A one-dimensional grid of 80 units is used 
by the SOM. The desired solution is that all the cities are covered by nodes, 
and all the additional nodes are along the lines between cities. The Euclidean 
distance is employed. Other parameters are the same as those for Example 8.2. 
The search results at the 10th, 100th, 1, 000th and 10, 000th epochs are illustrated 
in Fig. 8.10, and the total map length at the 10, 000th epoch is 5.2464. 
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Figure 8.9 One-dimensional SOM. The data points p,, pọ and pz are denoted by plus (+) signs. 


It is seen that the results from SOM are not satisfactory, and the routes are 
not feasible solutions since some cities are not covered by nodes. Nevertheless, 
SOM can be used to find a preliminary search for a suboptimal route, which can 
be modified manually to obtain a feasible solution. The SOM solution can be 
used as an initialization of other TSP solvers. In this case, we do not need to 
run SOM for many epochs. 

Classical SOM is not efficient for searching suboptimal solution for the TSP. 
Many practical TSP solvers have been developed using self-organizing neural- 
network models based on SOM, among which some solvers can find a suboptimal 
solution for a TSP of hundred cities within dozens of epochs. Most of them are 
based on the concept of elastic ring [61]. A large-scale TSP can be rapidly solved 
by a divide-and-conquer technique, where clustering methods are first used to 
group the cities and a local optimization algorithm is used to find the minimum 
in each group [96]. This speedup is offset by a slight loss in tour quality, but the 
structure is suitable for parallel implementation. 
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Figure 8.10 The TSP using one-dimensional SOM. Circles denote positions of the cities. 


Batch-mode SOM 
In the batch SOM algorithm [71], the prototypes are updated once for each 
epoch: 


t 
B Lin hwr(t)£t 
= t 

Dile hwk(t) 
where to and ty denote the start and finish of the present epoch, respectively, 


and cp(tf) are the prototype vectors computed at the end of the present epoch. 
The winning node at each presentation is computed using 


cx (tp) 3 (8.18) 


de(t) = min lle — en(to) I, (8.19) 
where cz(to) are the prototype vectors computed at the end of the previous 
epoch. The neighborhood functions h,,;(t) are computed from (8.13), but with 
the winning nodes determined from (8.19). 

Compared with the conventional online SOM method, batch SOM offers no 
dependence upon the order in which the input records are presented. In addition 
to facilitating the development of data-partitioned parallel methods, this also 
eliminates concerns that input records encountered later in the training sequence 
may overly influence the final results. The learning rate coefficient a(t) does not 
appear in batch SOM. Batch C-means and batch SOM optimize the same cost 
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functions as their online variants [22]. The batch training algorithm is generally 
much faster than the incremental algorithm. 


Adaptive-subspace SOM (ASSOM) 

Adaptive-subspace SOM (ASSOM) [72, 73] is a modular neural network model 
comprising an array of topologically ordered SOM submodels. ASSOM creates 
a set of local subspace representations by competitive selection and cooperative 
learning. Each submodel is responsible for describing a specific region of the 
input space by its local principal subspace, and represents a manifold such as 
a linear subspace with small dimensionality, whose basis vectors are determined 
adaptively. ASSOM not only inherits the topological representation property of 
SOM, but provides learning results that reasonably describe the kernels of various 
transformation groups like PCA. ASSOM is used to learn a number of invariant 
features, usually pieces of elementary one- or two-dimensional waveforms with 
different frequencies called wavelets, independent of their phases. Two fast imple- 
mentations of ASSOM are proposed in [123] based on the basis rotation operator 
of ASSOM. 


Learning vector quantization 


LVQ [68, 69] is a widely used approach to classification. LVQ employs exactly 
the same network architecture as the Kohonen network with the exception that 
each output neuron is specified with a classmembership and no assumption is 
made concerning the topological structure. The LVQ network is associated with 
the two-layer competitive learning network shown in Fig. 8.2. 

LVQ is based on the known classification of feature vectors, and can be treated 
as a supervised version of SOM. It is used for vector quantization and classifi- 
cation, as well as for fine-tuning of SOM. LVQ algorithms define near-optimal 
decision borders between classes, even in the sense of classical Bayesian decision 
theory. 

LVQ minimizes the functional (8.5), where xz, = 1 if neuron k is the winner 
and zero otherwise, when pattern pair p is presented. LVQ works on a set of N 
pattern pairs (£p, Yp): where £p € RI is the input vector and Yp E R* is the 
binary target vector that codes the classmembership, that is, only one entry of 
Yp takes value unity, while all its other entries are zero. Kohonen proposed a 
family of LVQ algorithms including LVQ1, LVQ2 and LVQ3 [69]. Assuming that 
pattern p is presented at time t, LVQ1 is given as 


Cult +1) = ew(t) + n(k) [£t — cu(t)], Upa =], 
Cw(t +1) = cu (t) — n(t) [£t — cw(k)], Yp,w = 0, 
ci(t ar 1) = cilt), 1 Æ w, (8.20) 
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Algorithm 8.2 (LVQ1). 


1. Sett=0. 
2. Initialize all ck(0) and (0). 
3. Repeat until a criterion is satisfied: 
. Present pattern x+. 
. Select the winning neuron for a, by (8.1). 
. Update the prototypes for all neurons by (8.20). 


. Decrease a(t). 
. Sett=t+1. 





where w is the index of the winning neuron, %; = £p, Yp,w = 1 and 0 represent 
the cases of correct and incorrect classifications of x,, respectively, and 7(¢) is 
defined as in earlier formulations. When it is used to fine-tune SOM, one can 
start with small 7(0), usually less than 0.1. LVQ1 tends to reduce the point 
density of c; around the Bayesian decision surfaces. 

LVQ1 can be considered a modified version of online C-means in which class 
labels affect the way that the clustering process is performed and online gradient 
descent is used over a cost function [11]. LVQ1 is given by Algorithm 8.2. 

OLVQ1 is an optimized version of LVQ1. In OLVQ1, each codebook vector ci 
is assigned an individual adaptive learning rate [74] 


mit — 1) 
2, 8.21 
m(t) 1+ s(t)n(t — 1) (521) 
where s(t) = +1 for correct classification and s(t) = —1 for wrong classification. 


Since 7;(¢) may increase, it should be limited to be less than 1. One can restrict 
m(t) < (0), and set 7;(0) = 0.3. The convergence of OLVQ1 may be up to one 
order of magnitude faster than that of LVQ1. 

LVQ2 and LVQ3 comply better with the Bayesian decision surface. In LVQ1, 
only one codebook vector c; is updated at each step, while LVQ2 and LVQ3 
change two codebook vectors simultaneously. Different LVQ algorithms can be 
combined in the clustering process. However, both LVQ2 and LVQ3 have the 
problem of reference vector divergence [106]. In a generalization of LVQ2 [106], 
this problem is eliminated by applying gradient descent on a nonlinear cost 
function. 

Addition of training counters to individual neurons of LVQ can effectively 
record the training statistics of LVQ [98]. This allows for dynamic self-allocation 
of the neurons to classes. At the generalization stage, these counters provide 
an estimate of the reliability of classification of the individual neurons. The 
method turns out to be especially valuable in handling strongly overlapping 
class distributions in pattern space. 
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Figure 8.11 Classification using LVQ1. (a) The data set. (b) The classification result. (c) MSE. 




















Example 8.4: We generate 50 data points of two classes that are nonlinear sepa- 
rable. An LVQ network can solve this problem with no difficulty. LVQ1 is used in 
this example. We set the number of training epochs to 200 and the learning rate 
to 0.02. The original classification and the output of the LVQ network are shown 
in Fig. 8.11. For this trial, a training MSE of 0 is achieved, and the classification 
for 10 test points generates reasonable results. 


LVQ and its variants are purely heuristically motivated local learners for adap- 
tive nearest prototype classification. They suffer from the problem of instabilities 
for overlapping classes. They are sensitive to the initialization of prototypes, and 
are restricted to classification scenarios in Euclidean space. Generalized rele- 
vance LVQ copes with these problems by integrating neighborhood cooperation 
to deal with local optima [52]. It shows very robust behavior. It obeys gradient 
dynamics, and the chosen objective is related to margin optimization. 

Using concepts from statistical physics and online learning, a mathematical 
framework is presented in [48], [14] to analyze the performance of different LVQ 
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algorithms including LVQ1, LVQ2.1 [70], and learning-from-mistakes in terms 
of their dynamics, sensitivity to initial conditions, and generalization ability. 
LVQ1 shows near optimal asymptotic generalization error for all choices of the 
prior distribution in the equal class variance case, independent of initialization. 
Learning-from-mistakes is a crisp version of robust soft LVQ [108]. A global 
cost function is lacking for LVQ1, whereas a cost function is available for a soft 
version like for LVQ2.1 and learning-from-mistakes. Soft LVQ algorithms [108] 
are derived from an objective function based on a likelihood ratio using gradient 
descent, leading to better classification performance than LVQ2.1. The behavior 
of LVQ2.1 is unstable. 


Nearest-neighbor algorithms 


The nearest-neighbor rule [30] is a widely used learning algorithm. The k-NN 
approach is statistically inspired in the estimation of the posterior probability 
p(H;\a) of the hypothesis H;, conditioned on an observation point æ. It is among 
the simplest, nonparametric and most successful classification methods [58]. 

A k-NN classifier classifies an input by identifying the k examples with the 
closest inputs and assigning the class label from the majority of those examples. 
The algorithm is simple to implement, it works fast for small training sets, and its 
performance asymptotically approaches that of the Bayes’ classifier. The Parzen 
window approach has the drawback that the data are very sensitive to the choice 
of cell size, while k-NN solves this by letting the cell volume be a function 
of the data. The classification performance of k-NN varies significantly with 
k. Therefore, the optimal value of k can be found by using a trial-and-error 
procedure. In k-NN, all neighbors receive equal importance. 

k-NN is also used for outlier detection. All training patterns are used as proto- 
types and an input pattern is assigned to the class with the closest prototype. It 
generalizes well for large training sets, and the training set can be extended at any 
time. The nearest-neighbor classifier is a local learning system: it fits the train- 
ing data only in a region around the location of an input pattern. It converges 
to Bayes’ classifier as the number of neighbors k and the number of prototypes 
M tend to infinity at an appropriate rate for all distributions. The theoretical 
asymptotic classification error is upper-bounded by twice Bayes’ error. However, 
it requires the storage of the whole training set which may be excessive for large 
data sets, and has a computational complexity of O (N?) for a set of N patterns. 
It also takes a long time for recall. Thus, k-NN is impractical for large training 
sets. To classify pattern æ, k-NN is given by Algorithm 8.3. 

By sorting the data with respect to each attribute as well as using complex 
data structures such as kd-trees, significant gain in performance can be obtained. 
By replacing the sort operation with the calculation of the order statistics, the 
k-NN method can further be improved in speed and its stability with respect to 
the order of presentation of the data [8]. 
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Algorithm 8.3 (k-NN). 
1. Find the k nearest patterns to x in the set of prototypes 


P ={(m,;,cl(m,;),7 =0,..., M — 1}, where mj is a prototype 


that belongs to one of the M classes and cl(m;) is the class 
indicator variable. 
2. Classify by a majority vote amongst these k patterns. 





Nearest-neighbor classification is a lazy learning method because training data 
is not preprocessed in any way. The class assigned to a pattern is the class of the 
nearest pattern known to the system, measured in terms of a distance defined 
on the feature space. On this space, each pattern defines its Voronoi region. For 
the Euclidean distance, Voronoi regions are delimited by linear borders. In prac- 
tice, k = 1 is a common choice, since the Euclidean 1-NN classifier forms class 
boundaries with piecewise linear hyperplanes and any border can be approxi- 
mated by a series of locally defined hyperplanes. Due to using only the training 
point closest to the query point, the bias of the 1-NN estimate is often low, but 
the variance is high. Asymptotically, the error rate of the 1-NN classifier is never 
more than twice the Bayes’ rate [30]. Considering a larger number of codebook 
vectors close to an input sample may lead to lower error rates than using the 
nearest prototype only, thus the k-NN rule usually outperforms the 1-NN rule. 
PAC error bounds for k-NN classifiers are O(N~?/>) for a training set of size N 
[7]. 

The set of prototypes P is computed from training data. A simple method 
is to select the whole training set as P, but this results in large memory and 
execution requirements for large databases. Therefore, in practice, a small set of 
prototypes of size M is mandatory. To improve over 1-NN classification, more 
than one neighbor may be used to determine the class of a pattern (k-NN), 
or distances other than the Euclidean may be used. A further refinement in 
nearest-neighbor classification is replacing the original training data by a set 
of prototypes that correctly represent it. Thus classification of new patterns is 
performed much faster. Besides, these nearest-prototype algorithms improve the 
accuracy of basic nearest-neighbor classifiers. 


Example 8.5: By using STPRtool (http://cmp.felk.cvut.cz/cmp/software/ 
stprtool/), we create the k-NN classification rule using the Riply’s data for 
training and testing. The training data and the decision boundary are plotted 
in Fig. 8.12. The testing error is 0.15 for k = 1, 0.127 for k = 4, and 0.082 for 
k = 16. 
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(a) (b) 
Figure 8.12 The decision boundary of the k-NN classifier: (a) k = 4, (b) k = 16. 


A set of unlabelled prototypes can be first obtained from training data by 
clustering. These prototypes can then be used to divide the input space in k-NN 
cells. Finally, labels are assigned to prototypes according to a majority vote by 
the training data in each cell. However, a one-step learning strategy such as LVQ 
is more efficient to compute labelled centroids. 

LVQ1 does not minimize the classification error. In [11], LVQ1 is generalized for 
nearest-neighbor classifiers. It is based on a regularizing parameter that mono- 
tonically decreases the upper bound of the training classification error towards a 
minimum. LVQ1 places prototypes around Bayes’ borders and consequently the 
resulting nearest-neighbor classifier estimates Bayes’ classifier. The regularized 
LVQ1 algorithm improves the classification rate of LVQ1 [11]. 

There are two procedures for reducing the number of prototypes. Editing pro- 
cesses the training set to increase generalization capabilities by removing proto- 
types that contribute to the misclassification rate, for example, removing outlier 
patterns or removing patterns that are surrounded mostly by others of different 
classes [112]. Condensing is to obtain a small template that is a subset of the 
training set without changing the nearest-neighbor decision boundary substan- 
tially. This can be established by reducing the number of prototypes that are 
centered in dense areas of the same class [54], [115]. Improved k-NN [115] reduces 
the training set required for k-NN while maintaining the same level of classifica- 
tion accuracy. This is implemented by iteratively eliminating patterns with high 
attractive capacities. The algorithm filters out a large portion of prototypes that 
are unlikely to match against the unknown pattern. The condensing algorithm 
for k-NN, namely the template reduction for k-NN [40], drops patterns that are 
far away from the boundary. A chain is defined as a sequence of nearest neigh- 
bors from alternating classes. Patterns further down the chain are close to the 
classification boundary. 

Learning k-NN rules [79] is suggested as a refinement of LVQ1. LVQ1 performs 
better than any of the learning k-NN rules if the codebook size is rather small. 
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However, those extensions to k neighbors achieve lower error rates than LVQ1 
when the codebook size is substantially increased. These results agree with the 
asymptotical theoretical analysis for the k-NN rule, in the sense that this can 
reach the optimal classification accuracy only when a sufficiently large number 
of codebook vectors is available. 

A method proposed in [5] enhances the estimate of the posterior probabilities 
for pattern classification. It is based on observing the k nearest neighbors and 
weighting their contribution to the posterior probabilities differently. The weights 
are estimated using the ML procedure. 


Neural gas 


Neural gas [92] is a vector quantization model that minimizes a known cost func- 
tion and converges to the C-means quantization error via a soft-to-hard compet- 
itive model transition. The soft-to-hard annealing process helps the algorithm 
to escape from local minima. Neural gas is a topology-preserving network. It is 
particularly useful for three-dimensional (3-D) reconstruction. It can be treated 
as an extension to C-means. It has a fixed number K of processing units with 
no lateral connection. 

The goal of neural gas is to find prototypes c; E€ R™, i = 1,..., K, such that 
these prototypes represent the underlying distribution P as accurately as possi- 
ble, minimizing the cost function [92]: 


1 K 
Ena(ci) = O È J Mes enalw, ei) P (da), (8.22) 


where d(-,-) denotes the Euclidean distance, r;(x, c;) = {c;|d(æ, cj) < d(x, c;)}| 
is the rank of the prototypes sorted according to the distances, h(t) = e~*/* with 
À > 0 decreasing with time, and C(A) = $% h(ri). The learning rule is derived 
from gradient descent. 

A data optimal topological ordering is achieved by using neighborhood ranking 
within the input space at each training step. To find its neighborhood rank, each 
neuron compares its distance to the input vector with the distances of all the 
other neurons to the input vector. Neighborhood ranking provides a training 
strategy with mechanisms related to robust statistics, and neural gas does not 
suffer from the prototype underutilization problem (see Sect. 9.1). At each step t, 
the Euclidean distances between an input vector x, and all the prototype vectors 
ck(t), k =1,...,K, are calculated by 


dy (æ+) = ||x: — cx(t)|| (8.23) 


and d(t) = (dı (x+) ,. . . , dg (a+))". Each prototype c(t) is assigned a rank r;(t), 
which takes an integer value from 0 to K — 1, with 0 for the smallest and K — 1 
for the largest dp (a+). 


ww ai bbt.com DOOOO00 


244 Chapter 8. Clustering l: Basic clustering models and algorithms 


Algorithm 8.4 (Neural gas). 


1. Initialize K, ck, k = 1,..., K, po, no, pf; np and Ty. 
2. Sett =1. 
3. Repeat until a stopping criterion is satisfied: 

. Calculate distances dp (x+), k = 1,..., K, by (8.23). 


. Sort the components of d(t) and assign each prototype cp with a 
rank r;,(t), which is a unique value from 0 to K —1. 

. Calculate n(t), p(t) by (8.25). 

. Update ck, k =1,...,K, by (8.24). 

. Sett=t+1. 





The prototypes are updated by 


crlt +1) = x(t) + nh (re(t)) (ae — ex(t)) , (8.24) 


where h(r) =e *© realizes soft competition, and p(t) is the neighborhood width. 
When p(t) — 0, (8.24) reduces to the C-means update rule (8.34). During the 
iterations, both p(t) and 7(t) decrease exponentially from their initial positive 
values 





n(t) = o ptt) = po a | (8.25) 


no po 
where 79 and po are the initial decay parameters, nf and py are the final decay 
parameters, and Tp is the maximum number of iterations. 

The prototypes cx are initialized by randomly assigning vectors from the train- 
ing set. Neural gas is given by Algorithm 8.4. 

Unlike SOM, neural gas determines a dynamical neighborhood relation as 
learning proceeds. Neural gas can be derived from a gradient-descent procedure 
on a potential function associated with the framework of fuzzy clustering. It is 
not sensitive to neuron initialization. Neural gas automatically determines a data 
optimum lattice, such that a small quantization error can be achieved. 

Neural gas converges faster to a smaller error E than C-means, maximum- 
entropy clustering [105] and SOM. This advantage is achieved at the price of a 
higher computational effort. In a serial implementation, the complexity for neu- 
ral gas is O(K log K) while the other three methods all have a complexity of 
O(K), where K is the number of prototypes. Nevertheless, in parallel implemen- 
tation all the four algorithms have the same complexity, O(log K) [92]. In a fast 
implementation of sequential neural gas [28], a truncated exponential function 
is used as the neighborhood function and neighborhood ranking is implemented 
without evaluating and sorting all the distances. Given the same quality of the 
resulting codebook, this fast realization gains a speedup of five times over the 
original neural gas for codebook design in image vector quantization. 
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An exact mathematical analysis of vector quantization dynamics is presented 
in [113]. In case of no suboptimal local minima of the quantization error, WTA 
always converges to the best quantization error, but the search speed is sensi- 
tive to prototype initialization. Neural gas can improve convergence speed and 
achieve robustness to initial conditions. However, depending on the structure of 
the data, neural gas does not always obtain the best asymptotic quantization 
error. 

Lack of an output space has limited the application of neural gas to data pro- 
jection and visualization. Curvilinear component analysis [33] first performs vec- 
tor quantization of the data manifold in input space using SOM, and then makes 
a nonlinear projection of the quantizing vectors by minimizing a cost function 
based on the inter-point distances. The computational complexity is O(N). The 
output is not a fixed grid but a continuous space that is able to take the shape 
of the data manifold. Online visualization neural gas [39] concurrently adjusts 
the codebook vectors in input space and the codebook positions in a continu- 
ous output space. The method has a complexity of O(N log N). It outperforms 
SOM-based and neural gas based curvilinear component analysis methods, in 
both their batch and online versions for neighborhood sizes smaller than 20 or 
30. In general, neural gas based curvilinear component analysis exhibits much 
better performance than its SOM-based counterpart. 

Single pass extensions of neural gas and SOM [1] are based on a simple patch 
decomposition of the data set and fast batch optimization schemes of the under- 
lying cost function. The algorithms require fixed memory space, and maintain the 
benefits of the original ones including easy implementation and interpretation as 
well as large flexibility and adaptability. 

Based on the cost function of neural gas, a batch variant of neural gas [29] 
shows much faster convergence and can be interpreted as optimization of the cost 
function by the Newton method. Based on the notion of the generalized median 
in analogy to median SOM, a variant for non-vectorial proximity data can be 
introduced. Convergence of batch and median versions of neural gas, SOM and 
C-means are proved in a unified formulation in [29]. 


Competitive Hebbian learning 

In a Voronoi tessellation, when the prototype of each Voronoi region is connected 
to all the prototypes of its bordering Voronoi regions, a Delaunay triangulation 
is obtained. Competitive Hebbian learning [91, 93] is a method that generates 
a subgraph of the Delaunay triangulation of the prototypes, called an induced 
Delaunay triangulation, by masking the Delaunay triangulation with a data dis- 
tribution P(x). Induced Delaunay triangulation has been proved to be optimally 
topology-preserving in a general sense [91]. 

Given a number of prototypes in R7, competitive Hebbian learning succes- 
sively adds connections among them by evaluating input data drawn from a dis- 
tribution P(x). The method does not change the prototypes, but only generates 
topology according to them. For each input vector x, the two closest prototypes 
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Figure 8.13 Illustration of the Delaunay triangulation and an induced Delaunay triangulation. 


8.6 


are connected by an edge. This leads to an induced Delaunay triangulation, 
which is limited to those regions of the input space RI, where P(a) > 0. The 
Delaunay triangulation and the induced Delaunay triangulation are illustrated 
in Fig. 8.13. The Delaunay triangulation is represented by a mix of thick and 
thick-dashed lines, the induced Delaunay triangulation by thick lines, Voronoi 
tessellation by thin lines, prototypes by circles, and a data distribution P(a) by 
shaded regions. To generate an induced Delaunay triangulation, two prototypes 
are connected only if at least a part of the common border of their Voronoi 
polygons lies in a region where P(a) > 0. 

The topology-representing network [93] is obtained by alternating the learning 
steps of neural gas and competitive Hebbian learning, where neural gas is used 
to distribute a certain number of prototypes and competitive Hebbian learning is 
then used to generate a topology. An edge aging scheme is used to remove obso- 
lete edges. Competitive Hebbian learning avoids the topological defects observed 
for SOM. 


ART networks 


Adaptive resonance theory (ART) [49] is biologically motivated and is a major 
development of the competitive learning paradigm. The theory leads to an evolv- 
ing series of real-time unsupervised network models for clustering, pattern recog- 
nition, and associative memory [17, 18, 19, 21]. These models are capable of 
stable category recognition in response to arbitrary input sequences with either 
fast or slow learning. ART models are characterized by systems of differential 
equations, which formulate stable self-organizing learning methods. Instar and 
outstar learning rules are the two learning rules used. ART has the ability to 
adapt, yet not forget the past training, and this is referred to as the stability- 
plasticity dilemma [49, 17]. 

At the training stage, the stored prototype of a category is adapted when an 
input pattern is sufficiently similar to the prototype. When novelty is detected, 
ART adaptively and autonomously creates a new category with the input pattern 
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Figure 8.14 Architecture of the ART model. 


8.6.1 


as the prototype. The meaning of being sufficiently similar is dependent on a 
vigilance parameter p € (0,1). If p is large, the similarity condition becomes 
stringent and many finely divided categories are formed. In contrast, smaller p 
gives coarser categorization, resulting in fewer categories. 

The stability and plasticity properties as well as the ability to efficiently pro- 
cess dynamic data make ART attractive for clustering large, rapidly changing 
sequences of input patterns, such as in the case of data mining. However, the 
ART approach does not correspond to C-means and vector quantization in a 
global optimization sense [86]. The ART model family is sensitive to the order 
of presentation of the input patterns. ART models tend to build clusters of the 
same size, independently of the distribution of the data. 


ART models 


The ART model family includes a series of unsupervised learning models. ART 
networks employ a J-K recurrent architecture. The input layer F1, called a 
comparing layer, has J neurons, while the output layer F2, called a recognizing 
layer, has K neurons. Layers F1 and F2 are fully interconnected in both the 
directions. Layer F2 acts as a WTA network. The feedforward weights connecting 
to F2 neuron j are represented by vector wj, while the feedback weights from the 
same neuron are represented by vector cj. The vector c; stores the prototype of 
cluster 7. J is the number of features used to represent a pattern and the number 
of clusters K varies with the size of the problem. The architecture of the ART 
model is shown in Fig. 8.14. The feedforward weights connecting to F2 neuron 
)", while the feedback weights from the 
same neuron are represented by €j = (Cji, ¢j2,--- cjJ)”. The output selects one 
of the K prototypes, c;, by setting y; = 1 and yj = 0, j Æ i. 

The ART models are characterized by a set of short-term memory and 
long-term memory time-domain nonlinear differential equations. The short-term 


j are represented by wj = (Wij, .-., Wj 


memory equations describe the evolution of the neurons and the interactions 
between them, while the long-term memory equations describe the change of the 
interconnection weights with time as a function of the system state. Layer F1 
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stores the short-term memory for the current input pattern, while F2 stores the 
prototypes of clusters as the long-term memory. 

Three types of ART implementations can be distinguished, namely, full mode, 
short-term memory steady-state mode, and fast-learning mode [18, 107]. In full- 
mode implementation, both the short-term memory and long-term memory dif- 
ferential equations are realized. The short-term memory steady-state mode only 
implements the long-term memory differential equations, while the short-term 
memory behavior is governed by nonlinear algebraic equations. In fast-learning 
mode, both the short-term memory and the long-term memory are implemented 
by their steady-state nonlinear algebraic equations, and thus proper sequenc- 
ing of short-term memory and long-term memory events is required. The fast- 
learning mode is inexpensive and is most popular. 

The simplest and most popular ART model is ART 1 [17] for learning to 
categorize arbitrarily many, complex binary input patterns presented in an arbi- 
trary order. ART 2 [18] is designed to categorize analog or binary random input 
sequences. ART 2 has a more complex F1 field that allows it to stably cate- 
gorize sequences of analog inputs that can be arbitrarily close to one another. 
By characterizing the clustering behavior of ART 2, Burke [16] has found simi- 
larity between ART-based clustering and C-means clustering. In ART 2A [19], 
only feedforward connection between F1 and F2 is used in ART 2A learning. An 
implementation of ART 2A is given in [57]. ART-C 2A [57] applies a constraint- 
reset mechanism on ART 2A to allow a direct control on the number of output 
clusters generated during the self-organizing process. ART 2A and ART-C 2A 
have clustering quality comparable to that of C-means and SOM, but with an 
advantage in computation time [57]. 

The ARTMAP model family is a class of supervised learning methods. 
ARTMAP, also termed predictive ART, autonomously learns to classify arbi- 
trarily many, arbitrarily ordered vectors into recognition categories based on 
predictive success [20]. ARTMAP is self-organizing, self-stabilizing, match learn- 
ing, and real-time. It learns orders of magnitude more quickly and also is more 
accurate than BP. These are achieved by using an internal controller that jointly 
maximizes predictive generalization and minimizes predictive error by linking 
predictive success to category size on a trial-by-trial basis, using only local oper- 
ations. However, ARTMAP is very sensitive to the order of presentation of the 
training patterns. Fuzzy ARTMAP [21] is shown to be a universal approximator 
[110]. Many popular ART and ARTMAP models and algorithms are reviewed in 
[36, 37]. 


ART 1 


The main elements of basic ART 1 model are shown in Fig. 8.15. The two fields 
of neurons, F1 and F2, are linked both bottom-up and top-down by adaptive 
filters. The unsupervised two-layer feedforward (bottom-up) pattern recognition 


ww ai bbt.com DOOOO00 


Clustering l: Basic clustering models and algorithms 249 





: Orienting 
Attentional subsystem subsystem 
ns -7 a 'F2 reset, 
lL +i 
iene”. 





















































~ Input 


Figure 8.15 Architecture of ART 1 with supplemental units. G} and Go are outputs of gain control 
units. The F2 reset unit controls vigilance matching. 


network is termed an attentional subsystem. There is also an auxiliary subsystem, 
called the orienting subsystem, that becomes active during search. 

Since ART 1 in fast learning mode is most widely used, we only discuss this 
mode here. To begin with, all cluster categories are set as uncommitted. When a 
new pattern is presented at time t, the net input to F2 neuron 7 is given by 


net;(t) = a2 w,(t), j=1,...,K, (8.26) 
where Wj = (wij, Eas wg)”. 
Competition between F2 neurons is performed to select the winning neuron w 
such that 


netw(t) = aae net;(t). (8.27) 


Neuron w then undergoes a vigilance test so as to determine whether it is close 
enough to z: 


[æ A cult = 1)I| 


jæ] eae) 


= j 


where ^ denotes logical and operation. For binary values of x;, the Euclidean 
norm ||æ|| = >>, z4. 

If neuron w passes the vigilance test, the system enters the resonance mode, 
and the weights for the winning neuron are updated by 


Cu (t) = Cu (t — 1) A £z, (8.29) 


Lj|cult— 1) A x4] 


wi FT lee —D Neal 


(8.30) 


where L > 1 is a constant parameter. 

Otherwise, the F2 neuron reset mechanism is applied to remove neuron w from 
the current search by setting net,, = —1 and the system enters the search mode. 
If all the stored categories cannot pass the vigilance test, one of the uncommitted 
categories of the K categories is assigned to this pattern. 
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A popular fast-learning implementation is given in [95, 107]. For initialization, 
select 0 < p < 1 and L>1, w,(0) = z751 and ¢;(0) =1, j =1,...,K, where 
1 denotes a J-dimensional vector whose entries are all unity. In [107], w,(0) = 
1451 In the algorithm, p determines the level of abstraction at which ART 
discovers clusters. The minimal number of clusters present in the data can be 
determined by pmin < 5. Initial bottom-up weights are usually selected as 0 < 
wyr(0) < a Larger values of w;,(0) favor the creation of new nodes, while 
smaller values attempt to put a pattern into an existing cluster. The order of the 
training patterns may influence the final prototypes and clusters. Unlike many 
alternative methods such as SOM and the Hopfield network, ART 1 can deal 
with an arbitrary combination of binary input patterns. In addition, ART 1 has 
no restriction on memory capacity since its memory matrices are not square. 

ART models are typically governed by differential equations, which result in a 
high computational complexity for numerical implementations. Implementations 
using analog or optical hardware are more desirable. A modified ART 1 algo- 
rithm in fast-learning mode is used for easy hardware implementation [107]. The 
method has also been extended for full mode and short-term memory steady- 
state mode. A number of hardware implementations of ART 1 in different modes 
are also surveyed in [107]. 


C-means clustering 


The most well-known data-clustering technique is the statistical C-means (also 
known as k-means) algorithm [90]. The C-means algorithm approximates the 
ML solution for determining the locations of the means of a mixture density of 
component densities. It is closely related to simple competitive learning, and is 
a special case of SOM. The algorithm partitions a set of N input patterns, ¥, 
into K separated subsets Ck, each containing N; input patterns by minimizing 
the MSE function 


K 
1 2 
E (c1, eK)= F2 D> llen- el? (8.31) 
k=1 £nEĈk 
where Cp is the prototype or center of cluster Ck. To improve the similarity of 
samples in each cluster, one can minimize EF with respect to cz by setting se =0; 
thus, the optimal location of cp is the mean of the samples in the cluster 


Ck = — Ti. (8.32) 


C-means clustering can be implemented in either batch mode [85] or incre- 
mental mode [90]. Batch C-means [85], frequently called the Linde-Buzo-Gray, 
LBG or generalized Lloyd algorithm, is applied when the whole training set is 
available. When the training set is obtained online, incremental C-means is com- 
monly applied. 
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Algorithm 8.5 (C-means). 
. Set K. 


2. Arbitrarily select an initial cluster partition. 
. Repeat until the change in all cp is sufficiently small: 
a. Decide K cluster prototypes cp. 


b. Redistribute patterns among the clusters using criterion (8.31). 





In batch C-means, the initial partition is arbitrarily defined by placing each 
input pattern into a randomly selected cluster. The prototypes are defined to 
be the average of the patterns in the individual clusters. When C-means is per- 
formed, at each step the patterns keep changing from one cluster to the closest 
cluster c, according to the simple competitive learning rule 


læ: — ex|] = min |æ; — ezl (8.33) 


and the prototypes are then recalculated according to (8.32). 

In incremental C-means, each cluster is initialized with a random pattern as its 
prototype. C-means continues to update the prototypes upon the presentation 
of each new pattern. If at time t the kth prototype is c(t) and the input pattern 
is £+, then at time t + 1 incremental C-means gives the new prototype as 


cx(t) + n(t) (x; — x(t), k = arg; min ||æ — cyll 


cx(t), otherwise ; ee) 


Ck (t + 1) = { 

where 7(t) should slowly decrease to zero, and typically 7(0) < 1. 

Neighborhood cooperation such as for SOM and neural gas offers one biologi- 
cally plausible solution. Unlike SOM and neural gas, C-means is very sensitive to 
initialization of the prototypes since it adapts the prototypes only locally accord- 
ing to their nearest data points. The general procedure for C-means clustering 
is given by Algorithm 8.5. 

After the algorithm converges, we can calculate the variance vector, Ok = 
Greer ok J)”, for each cluster 


2 


Ng=1 





Ok i = a Sy ong APS Lyles (8.35) 

The relation between PCA and C-means is established in [35]. Principal com- 
ponents have been proved to be the continuous solutions to the discrete cluster 
membership indicators for C-means clustering, with a clear simplex cluster struc- 
ture [35]. Lower bounds for the C-means objective function (8.31) are derived as 


the total variance minus the eigenvalues of the data covariance matrix [35]. 
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Improvements on the C-means 

In [23], incremental C-means is improved by adding two mechanisms, one for 
biasing the clustering towards an optimal Voronoi partition by using a cluster 
variance-weighted MSE as the objective function and the other for adjusting the 
learning rate dynamically according to the current variances in all partitions. 
The method always converges to an optimal or near-optimum configuration. 

Enhanced LBG [101] is derived directly from LBG with a negligible overhead. 
The concept of utility of a codeword is a powerful instrument to overcome the 
problem of bad local minima arising from a bad choice of the initial codebook. 
The utility allows the identification of those badly positioned codewords, and 
guides their movement from the proximity of a local minimum in the error func- 
tion. Enhanced LBG outperforms LBG with utility [44] both in terms of accuracy 
and number of required iterations. 

To deal with the initialization problem, the global C-means algorithm [83] 
obtains near-optimal solutions in terms of clustering error by employing C-means 
as a local search procedure. It is an incremental-deterministic algorithm. It incre- 
mentally solves the M-clustering problem by solving all intermediate problems 
with 1,...,M clusters using C-means. Global C-means is better than C-means 
with multiple restarts. 

In an efficient implementation of LBG [63], the data points are stored by a k-d 
tree. The algorithm is typically one order of magnitude faster than LBG. A fast 
C-means clustering algorithm [80] uses the cluster center displacements between 
two successive partition processes to reject unlikely candidates for a data point. 
The computing time increases linearly with the data dimension d, whereas the 
computational complexity of k-d tree based algorithms increases exponentially 
with d. 

SYNCLUS [34] is a method for variable weighting in C-means clustering. Start- 
ing from an initial set of weights, it first uses C-means to partition data into K 
clusters. It then estimates a new set of optimal weights by optimizing a weighted 
mean-squares, stress-like cost function. The two stages alternate until they con- 
verge to an optimal set of weights. W-C-means [59] can automatically weight 
variables based on the importance of the variables in clustering. It adds a step to 
C-means to update the variable weights based on the current partition of data. 
W-C-means outperforms C-means in recovering clusters in data. The computa- 
tional complexity of the algorithm is O(tmN K) for t iterations, K clusters, m 
attributes, and N objects. 


Example 8.6: Clustering can be used for image segmentation. We apply C-means 
with K = 2, and the result is shown in Fig. 8.16. It indicates that the two birds 
are clearly segmented. 
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Figure 8.16 Image segmentation using clustering. (a) Original image. (b) Segmented image. 


8.8 





(a) (b) 


Figure 8.17 Image segmentation using clustering. (a) Original image. (b) Quantized image. 


Example 8.7: Figure 8.17 shows the result of using C-means for image quanti- 
zation. We select K = 16. It is shown that by quantizing a grayscale image from 
256 levels to 16 levels the image quality is still acceptable. The picture SNR for 
qualtization is 36.8569 dB. 


Subtractive clustering 


Mountain clustering [116] is a simple and effective method for estimating the 
count of clusters and the initial locations of the cluster centers, which are the 
difficulties faced by most conventional methods. The method grids the data space 
and computes a potential value for each grid point based on its distance to 
the actual data points. Each grid point is treated as a potential cluster center 
depending on its potential value. A measure of the potential for each grid is 
calculated based on the density of the surrounding data points. The grid with 
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the highest potential is selected as the first cluster center and then the potential 
values of all the other grids are reduced according to their distances to the 
first cluster center. Grid points closer to the first cluster center have greater 
reduction in potential. The next cluster center is located at the grid point with 
the highest remaining potential. This process is repeated until the remaining 
potential values of all the grids fall below a threshold. The grid structure causes 
the curse of dimensionality. 

Subtractive clustering [24] is a modified form of mountain clustering. The idea 
is to use all the data points to replace all the grid points as potential cluster 
centers. By this means, the effective number of grid points is reduced to the size 
of the pattern set, which is independent of the dimensionality of the problem. 
Subtractive clustering is a fast method for estimating clusters in the data. 

Subtractive clustering assumes each of the N data points in the pattern set, 
£i, to be a potential cluster center, and the potential measure is defined as a 
function of the Euclidean distances to all the other input data points 


N 
P= Y sre, SaN, (8.36) 
j=1 


where a = $, the constant ra > 0 being effectively a normalized radius defining 
the neighborhood. Data points outside this radius have insignificant influence on 
the potentials. A data point surrounded by many neighboring data points has a 
high potential value. Thus, the mountain and subtractive clustering techniques 
are less sensitive to noise than other clustering algorithms, such as C-means and 
FCM [13]. 

After the data point with the highest potential, x, with u = arg; max P(¢), 
is selected as the kth cluster center, that is, Ck = £u with P(k) = P(u) as its 
potential value, the potential of each data point x; is revised by subtracting a 
term associated with Ck 


P(i) = P(i) — P(k)e Ple el, (8.37) 
where 3 = 5, and the constant r, > 0 is a normalized radius defining the neigh- 
borhood. In order to avoid closely located cluster centers, select ry > ra, typically 
m= L 2ra 

The algorithm continues until the remaining potential of all the data points is 
below some fraction of the potential of the first cluster center, that is, 


P(k) = max P(i) < <P(1), (8.38) 


where £ € (0,1). When € is close to 0, a large number of hidden nodes will be 
generated. On the contrary, a value of £ close to 1 will lead to a small network 
structure. Typically, € is selected as 0.15. 

Subtractive clustering is described by Algorithm 8.6 [24, 25]. 

The training data x; can be scaled before applying the method. This helps 
in selecting proper values for a and 8. Since it is difficult to select suitable 
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Algorithm 8.6 (Subtractive clustering). 


. Set ra, Ty and €. 
N. 


. Calculate the potential values P(t), i = 1,... 
. Seek =1. 
. Repeat until P(k) < ¢P(1): 


? 


a. Find data point x, with u = arg; max P (i). 

b. Set the kth cluster center as £u, that is, Ck = £u and P(k) = P(u). 
c. Revise the potential of each data point x; by (8.37). 

d. Setk=k+1. 





£, additional criteria for accepting/rejecting cluster centers can be used. One 
method is to select two thresholds [24, 25], namely, = and £. Above =, cx is 
definitely accepted as a cluster center, while below g it is definitely rejected. If 
P(k) falls between the two thresholds, a tradeoff between a reasonable potential 
and its distance to the existing cluster centers must be examined 


dmin P k 

R = — + AR) (8.39) 
ra P(l) 

where dmin is the shortest of the distances between cą and c;,i=1,...,k—1. 


If R>1, accept ck and continue the algorithm. If R < 1, reject Ck and set 
P(k) = P(u) =0, and select the data point with the next highest potential as 
Ck and retest. 

Unlike C-means and FCM, which require iterations of many epochs, subtrac- 
tive clustering requires only one pass of the training data. Besides, the number of 
clusters does not need to be specified a priori. Subtractive clustering is a deter- 
ministic method: For the same network structure, the same network parameters 
are always obtained. 

Both C-means and FCM require O( NT) computations, where T is the total 
number of epochs and each computation requires the calculation of the dis- 
tance and the memberships. The computational load for subtractive cluster- 
ing is O (N 2 +4 KN ), each computation involving calculation of an exponential 
function. Thus, for small- or medium-size training sets, subtractive clustering 
is relatively fast. However, when N œ KT, subtractive clustering requires more 
training time [31]. 

Subtractive clustering provides only rough estimates of the cluster centers, 
since the cluster centers obtained are situated at some data points. Moreover, 
since œ and @ are not determined from the data set and no cluster validity is 
used, the clusters produced may not appropriately represent the clusters. For 
small data sets, one can try a number of values for a, 8 and £ and select a 
proper network structure. The results by subtractive clustering can be used to 
determine the number of clusters and their initial values for initializing iterative 
clustering algorithms such as C-means and FCM. 
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Subtractive clustering can be improved by performing a search over a and £, 
which makes it essentially equivalent to the least-biased fuzzy clustering algo- 
rithm [9]. The least-biased fuzzy clustering, based on deterministic annealing 
[105, 104], tries to minimize the clustering entropy of each cluster, namely, the 
entropy of the centroid with respect to the clustering membership distribution of 
data points, under the assumption of unbiased centroids. Subtractive clustering 
can be realized by replacing the Gaussian potential function with a Cauchy-type 
function of first order [3]. 

In [100], mountain clustering and subtractive clustering are improved by tuning 
the prototypes obtained using the gradient-descent method so as to maximize 
the potential function. By modifying the potential function, mountain clustering 
can also be used to detect other types of clusters like circular shells [100]. 


Fuzzy clustering 


Fuzzy clustering is an important class of clustering algorithms. It helps to find 
natural vague boundaries in data. We introduce some fuzzy clustering algorithms 
in this section. Preliminaries of fuzzy sets and fuzzy logic are given in Chapter 21. 


Fuzzy C’-means clustering 


The discreteness of each cluster endows the C-means algorithm with analytical 
and algorithmic intractabilities. Partitioning the data set in a fuzzy manner helps 
to circumvent such difficulties. FCM clustering [13], also known as the fuzzy 
ISODATA [38], considers each cluster as a fuzzy set, and each feature vector 
may be assigned to multiple clusters with some degree of certainty measured by 
the membership function taking values in [0, 1]. 

FCM optimizes the objective function 


K N 
E= X ph le: — cll’, (8.40) 
j=1 i=1 


where N is the size of the input pattern set, U = {u;i} denotes the membership 
matrix whose element uji denotes the membership of x; into cluster j and uji € 
[0, 1]. The parameter m € (1, 00) is a weighting factor called a fuzzifier. For better 
interpretation, the following condition must be satisfied: 


K 
X d, alae, (8.41) 
j=1 
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Algorithm 8.7 (Fuzzy C-means). 
. Sett=0. 


2. Initialize K, € and m. 
. Randomize and normalize U(0) according to (8.41), and then calcu- 
late c;(0), j= 1,..., K, by (8.48). 
Or alternatively, set c;(0), j =1,...,K, and then calculate U(0) by 


(8.42). 
. Repeat until e(t) < e: 
a. Sett=t+1. 
b. Calculate uji(t) and e,;(t) according to (8.42) and (8.43). 
c. Calculate e(t) by (8.44). 





By minimizing (8.40) subject to (8.41), the optimal membership function piji 
and cluster centers are derived as 


1 
( 1 ) m-I 
a 
= læ:—c;l| 


H th a I, Ny = 1 K, (8.42) 
Dies i— ) T 
— Dini (Ma) Big Weng K. (8.43) 


c= = 
{mona 
Equation (8.42) corresponds to a soft-max rule and (8.43) is similar to the mean 
of the data points in a cluster. The two equations are dependent on each other. 
The iterative optimization procedure is known as alternating optimization. 
The iteration process terminates when the change in the prototypes 


K 
e(t) = 2 lles(t) — et- DI? (8.44) 


is sufficiently small. FCM is summarized in Algorithm 8.7. 

In FCM, the fuzzifier m determines the fuzziness of the partition produced, 
and reduces the influence of small membership values. If m — 1+, the resulting 
partition asymptotically approaches a hard or crisp partition. On the other hand, 
the partition becomes a maximally fuzzy partition if m — oo. FCM with a high 
degree of fuzziness diminishes the probability of getting stuck at local minima 
[13]. A typical value for m is 1.5 or 2.0. Interval type-2 FCM accepts an interval- 
valued fuzzifier [mz, mpr] [62], and general type-2 FCM [84] extends interval 
type-2 FCM via the a-planes representation theorem. 

FCM needs to store the membership matrix U and all the prototypes c;. The 
alternating estimation of U and c;’s causes a computational and storage burden 
for large-scale data sets. Computation can be accelerated by combining their 
updates [76], and consequently storage of U is avoided. The single iteration 
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timing of the accelerated method grows linearly with K, while that of FCM 
grows quadratically with K since the norm calculation introduces another nested 
summation [76]. 

C-means is a special case of FCM, when uji is unity for only one class and 
zero for all the other classes. Like C-means, FCM may find a local optimum for 
a specified number of centers. The result is dependent on the initial membership 
matrix U(0) or cluster centers ¢;(0), j =1,...,K. 

Single-pass FCM and online FCM [51] facilitate scaling to very large numbers 
of examples while providing partitions that very closely approximate those one 
would obtain using FCM. 

FCM has been generalized by introducing the generalized Boltzmann distribu- 
tion to escape local minima [103]. In [121], the relation between the stability of 
the fixed points of FCM, (U*,%), and the dataset is given. This relation provides 
a theoretical basis for selecting the weighting exponent in FCM. 

Penalized FCM [118] is a convergent generalized FCM obtained by adding a 
penalty term associated with uji. A weighted FCM [109] is used for fuzzy mod- 
eling towards developing a Takagi-Sugeno-Kang (TSK) fuzzy model of optimal 
structure. All these and many other generalizations of FCM can be analyzed in 
a unified framework, termed the generalized FCM [122], by using the Lagrange 
multiplier method from an objective function that comprises a generalization of 
the FCM criterion and a regularization term representing the constraints. 

In an agglomerative FCM algorithm [82], a penalty entropy term is introduced 
to the objective function of FCM to make the clustering process not sensitive to 
the initial cluster centers. The initial number of clusters is set to be larger than 
the true number of clusters in a dataset. With the entropy cost function, each 
initial cluster centers will move to the dense centers of the clusters in a data set. 
These initial cluster centers are merged in the same location, and the number of 
the determined clusters is just the number of the merged clusters in the output 
of the algorithm. 

e-insensitive FCM (eFCM) is an extension to FCM that is obtained by intro- 
ducing the robust statistics using Vapnik’s ¢-insensitive estimator as the loss 
function to reduce the effect of outliers [81]. It is based on Ly-norm clustering 
[66]. Other robust extensions to FCM includes L,-norm clustering (0 < p < 1) 
[56] and Lj-norm clustering [66]. 

The concept of a-cut implementation can be used to form cluster cores such 
that the data points inside the cluster core will have a membership value of 1. 
FCMa [119] can achieve robustness for suitably large m values with the same 
computational complexity as FCM. The cluster cores generated by FCMa are 
suitable for nonspherical shape clusters. FCMa is equivalent to FCM when a = 1. 
When the weighting exponent m becomes larger, FCMa clustering trims most 
noisy points. 

When the data set is a blend of unlabeled and labeled patterns, FCM with par- 
tial supervision [102] can be applied. The classification information is added to 
the objective function used in FCM, and FCM with partial supervision is derived 
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following the same procedure as that of FCM. Conditional FCM and determinis- 
tic annealing clustering [104] consider various contributions of different samples 
and take account of sample weighting. Locality-weighted C-means and locality- 
weighted FCM are two locality-sensitive algorithms [60], where the neighborhood 
structure information between objects are transformed into weights of objects. 
The weight between a point and a center is in form of a Gaussian function. In 
addition, two semi-supervised extensions of locality-weighted FCM are proposed 
to better use some given partial supervision information in data objects. 


Other fuzzy clustering algorithms 


There are numerous other clustering algorithms based on the concept of fuzzy 
membership. Two early fuzzy clustering algorithms are Gustafson-Kessel clus- 
tering [50] and adaptive fuzzy clustering [2]. Gustafson-Kessel clustering extends 
FCM by using the Mahalanobis distance, and is suited for hyperellipsoidal clus- 
ters of equal volume. This algorithm takes typically five-fold the time for FCM 
to complete cluster formation [64]. Adaptive fuzzy clustering [2] also employs the 
Mahalanobis distance, and is suitable for ellipsoidal or linear clusters. Gath-Geva 
clustering [46] is derived from a combination of FCM and fuzzy ML estimation. 
The method incorporates the hypervolume and density criteria as cluster-validity 
measures and performs well in situations of large variability of cluster shapes, 
densities, and number of data points in each cluster. 

C-means and FCM are based on the minimization of the trace of the (fuzzy) 
within-cluster scatter matrix. The minimum scatter volume and minimum cluster 
volume algorithms are two iterative clustering algorithms based on determinant 
(volume) criteria [77]. The minimum scatter volume algorithm minimizes the 
determinant of the sum of the scatter matrices of the clusters, while the minimum 
cluster volume algorithm minimizes the sum of the volumes of the individual 
clusters. The behavior of the minimum scatter volume algorithm is similar to that 
of C-means, whereas the minimum cluster volume algorithm is more versatile. 
The minimum cluster volume algorithm in general gives better results than the 
C-means, minimum scatter volume, and Gustafson-Kessel algorithms do, and is 
less sensitive to initialization than the EM algorithm. 

A cluster represented by a volume prototype implies that all the data points 
close to a cluster center belong fully to that cluster. In [65], Gustafson-Kessel 
clustering and FCM have been extended by using the volume prototypes and 
similarity-driven merging of clusters. 

Soft-competitive learning in clustering algorithms has the same function as 
fuzzy clustering [6]. The softcompetition scheme [117] is another soft version of 
LVQ. Soft-competition scheme asympototically evolves into the Kohonen learn- 
ing algorithm. It is a sequential, deterministic vector quantization algorithm, 
which is realized by modifying the neighborhood mechanism of the Kohonen 
learning algorithm and incorporating the stochastic relaxation principles. Soft- 
competition scheme consistently provides better codebooks than incremental C- 
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means, even for the same computation time. The learning rates of the soft- 
competition scheme are partially based on posterior probabilities. 

Generalized LVQ [99] introduces softcompetition into LVQ by updating every 
prototype for each input vector. If there is a perfect match between the incoming 
input and the winner node, then generalized LVQ reduces to LVQ. On the other 
hand, the greater the mismatch to the winner, the larger the impact of an input 
vector on the update of the nonwinner nodes. Generalized LVQ is very sensi- 
tive to simple scaling of the input data, since its learning rates are reciprocally 
dependent on the sum of the squares of the distances from an input vector to 
the node weight vectors. 

FCM-DFCV [32] is a method based on an adaptive quadratic distance for each 
class defined by a diagonal matrix, and is a special case of Gustafson-Kessel clus- 
tering [50] based on a quadratic adaptive distance of each cluster defined by a 
fuzzy covariance matrix. The methods based on adaptive distances outperform 
FCM. There are also many fuzzy ART and ARTMAP models, and fuzzy clus- 
tering algorithms based on the Kohonen Network and LVQ, ART networks, or 
the Hopfield network. These algorithms are reviewed in [36], [37]. 

Fuzzy C-regressions [55] embeds FCM into switching regression. The method 
always depends heavily on the initial values. Mountain C-regressions [114] solves 
the initial-value problem. Using a modified mountain clustering to extract C 
cluster centers in the transformed data space, which correspond to C regression 
models in the original data set, mountain C-regressions can form well-estimated 
C regression models for switching regression data sets. According to the proper- 
ties of transformation, mountain C-regressions is also robust to noise and outliers. 


Example 8.8: In this example, we illustrate three popular clustering algorithms: 
C-means, FCM, and subtractive clustering. The input data represents three clus- 
ters centered at (2,2), (—2,2), and (—1, —3), each having 200 data with a Gaus- 
sian distribution V(0,1) in both x and y directions. The initial cluster centers 
for C-means is randomly sampled from the data set. The fuzzifier of FCM is 
selected as m = 2. For C-means and FCM, the termination criterion is that the 
error in the objective function for two adjacent iterations is less than 1075 and 
the maximum number of epochs is 100. We specify the number of clusters as 3. 
For subtractive clustering, the parameter e€ is selected as 0.4. 

Simulations are performed based on averaging of 1000 random runs. The simu- 
lation results are listed in Table 8.1. As far as this artificial data set is concerned, 
C-means and FCM have almost the same performance, which is considerably 
superior to that of subtractive clustering. However, subtractive clustering is a 
deterministic method, and it can automatically detect a suitable number of clus- 
ters for a wide range of e. The clustering results for a random run are illustrated 
in Fig. 8.8. There are minor differences in the cluster boundaries and the cluster 
centers between the algorithms. 
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Table 8.1. Comparison of C-means, FCM and subtractive clustering for an artificial 
data set. d,,. stands for the mean within-cluster distance, and dpe for the mean 
between-cluster distance. A smaller value of dwe [doc corresponds to better 
performance. 
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Figure 8.18 Clustering using three methods: C-means, FCM, and subtractive clustering. The stars 
denote the cluster centers. 


Problems 


8.1 A two-dimensional grid of neurons is trained with three-dimensional input 
data randomly distributed in a volumn defined by 0 < 21, 22,x%3 < 1. Use SOM 
to to find the network topology of 8 x 8 neurons after 10, 1,000, and 10,000 
iterations. 
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8.2 Failure of topological ordering may usually occurs during SOM iterations. 
This may be caused a rapid decaying of the neighborhoold size of the winning 
node. Verify this statement by using computer simulation. 


8.3 For SOM, if a node fails, what is its effect: 
(a) during the learning process; (b) after learning is completed? 


8.4 Define an energy function to be minimized for clustering or vector quanti- 
zation problems. Is it possible to solve it using the Hopfield network? 


8.5 Write a MATLAB program to implement the k-NN algorithm. Classify the 
Iris problem using the program. 


8.6 Both SOM and neural gas are well-known topology-preseving clustering 
algorithms. Implement neural gas on the data set for Problem 8.1. Try to draw 
conclusion from the simulation result. 


8.7 Clustering is used to partition image pixels into K clusters. Although the 
intensity-based FCM algorithm functions well on segmenting most noise-free 
images, it fails to segment images corrupted by noise, outliers, and other imaging 
artifacts. Show this on a magnetic resonance imaging (MRI) iamge by computer 
experiments. 


8.8 Show that the computational complexity of C-means is O(N KdT). 


8.9 Consider the quantization of the Lena image of 512 x 512 pixels with 256 
gray levels. LBG is applied to the quantization of the 4 x 4 subimages extracted 
from the original pictures. 

(a) Calculate the MSE when 32 codewords are used. 

(b) Draw the original and reconstructed images. 

(c) Calculate the picture SNR. 


References 


1] N. Alex, A. Hasenfuss & B. Hammer, Patch clustering for massive data sets. 
Neurocomput., 72 (2009), 1455-1469. 

2] I.A. Anderson, J.C. Bezdek & R. Dave, Polygonal shape description of plane 
boundaries. In: L. Troncale, Ed., Systems Science and Science (Louisville, KY: 
SGSR, 1982), 1, 295-301. 

3] P.P. Angelov & D.P. Filev, An approach to online identification of Takagi- 
Sugeno fuzzy models. IEEE Trans. Syst. Man Cybern. B, 34:1 (2004), 484-498. 
4| T. Aoki & T. Aoyagi, Self-organizing maps with asymmetric neighborhood 
function. Neural Comput., 19 (2007), 2515-2535. 

5| A.F. Atiya, Estimating the posterior probabilities using the K-nearest neigh- 
bor rule. Neural Comput., 17 (2005), 731-740. 








ww ai bbt.com DOOOO00 


REFERENCES 263 


6] A. Baraldi & P. Blonda, A survey of fuzzy clustering algorithms for pattern 

recognition—Part II. IEEE Trans. Syst. Man Cybern. B, 29:6 (1999), 786-801. 

7| E. Bax, Validation of k-nearest neighbor classifiers. IEEE Trans. Inf. Theory, 

58:5 (2012), 3225- 3234. 

8] G. Beliakov & G. Li, Improving the speed and stability of the k-nearest 

neighbors method. Pattern Recogn. Lett., 33 (2012), 1296-1301. 

9| G. Beni & X. Liu, A least biased fuzzy clustering method. IEEE Trans. 
Pattern Anal. Mach. Intell., 16:9 (1994), 954-960. 

10] E. Berglund & J. Sitte, The parameterless self-organizing map algorithm. 

IEEE Trans. Neural Netw., 17:2 (2006), 305-316. 

11] S. Bermejo, The regularized LVQ1 algorithm. Neurocomput., 70 (2006), 

475-488. 

12] J. Bezdek, Cluster validity with fuzzy sets. J. Cybern., 3:3 (1974), 58-71. 

13] J. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms 

(New York: Plenum Press, 1981). 

14] M. Biehl, A. Ghosh & B. Hammer, Dynamics and generalization ability of 

LVQ algorithms. J. Mach. Learn. Res., 8 (2007), 323-360. 

15] A. Brodal, Neurological Anatomy in Relation to Clinical Medicine, 3rd Edn. 

(New York: Oxford University Press, 1981). 

16] L.I. Burke, Clustering characterization of adaptive resonance. Neural Netw., 

4:4 (1991), 485-491. 

17] G.A. Carpenter & S. Grossberg, A massively parallel architecture for a self- 

organizing neural pattern recognition machine. Computer Vision, Graphics, 

Image Process., 37 (1987), 54-115. 

18] G.A. Carpenter & S. Grossberg, ART 2: Self-organization of stable category 

recognition codes for analog input patterns. Appl. Optics, 26 (1987), 4919- 

4930. 

19] G. Carpenter, S. Grossberg & D.B. Rosen, ART 2-A: An adaptive resonance 

algorithm for rapid category learning and recognition. Neural Netw., 4 (1991), 

493-504. 

20] G.A. Carpenter, S. Grossberg & J.H. Reynolds, ARTMAP: Supervised real- 

time learning and classification of nonstationary data by a self-organizing neu- 

ral network. Neural Netw., 4:5 (1991), 565-588. 

21) G.A. Carpenter, S. Grossberg, N. Markuzon, J.H. Reynolds & D.B. Rosen, 
Fuzzy ARTMAP: A neural network architecture for incremental supervised 
learning of analog multidimensional maps. IEEE Trans. Neural Netw., 3 
(1992), 698-713. 

[22] Y. Cheng, Convergence and ordering of Kohonen’s batch map. Neural Com- 
put., 9 (1997), 1667-1676. 

[23] C. Chinrunrueng & C.H. Sequin, Optimal adaptive k-means algorithm with 

dynamic adjustment of learning rate. IEEE Trans. Neural Netw., 6:1 (1995), 

157-169. 

















ww ai bt. com DOOOO00 


264 


Chapter 8. Clustering l: Basic clustering models and algorithms 


24] S. Chiu, Fuzzy model identification based on cluster estimation. J. intell. & 
Fuzzy Syst., 2:3 (1994), 267-278. 

25| S.L. Chiu, A cluster estimation method with extension to fuzzy model iden- 
tification. In: Proc. IEEE Int. Conf. Fuzzy Syst., Orlando, FL, 2 (1994), 1240- 
1245. 

26] J. Cho, A.R.C. Paiva, S.-P. Kim, J.C. Sanchez & J.C. Principe, Self- 
organizing maps with dynamic learning for signal reconstruction. Neural 
Netw., 20 (2007), 274-284. 

27| C.S.T. Choy & W.C. Siu, A class of competitive learning models which 
avoids neuron underutilization problem. IEEE Trans. Neural Netw., 9:6 
(1998), 1258-1269. 

28] C.S.T. Choy & W.C. Siu, Fast sequential implementation of “neural-gas” 
network for vector quantization. IEEE Trans. Commun., 46:3 (1998), 301-304. 
29) M. Cottrell, B. Hammer, A. Hasenfuss & T. Villmann, Batch and median 
neural gas. Neural Netw., 19 (2006), 762-771. 

30] T.M. Cover & P.E. Hart, Nearest neighbor pattern classification. IEEE 
Trans. Inf. Theory, 13 (1967), 21-27. 

31] R.N. Dave & R. Krishnapuram, Robust clustering methods: a unified view. 
IEEE Trans. Fuzzy Syst., 5:2 (1997), 270-293. 

32] F.A.T. de Carvalho, C.P. Tenorio & N.L. Cavalcanti, Jr., Partitional fuzzy 
clustering methods based on adaptive quadratic distances. Fuzzy Sets Syst., 
157 (2006), 2833-2857. 

33] P. Demartines & J. Herault, Curvilinear component analysis: A self- 
organizing neural network for nonlinear mapping of data sets. IEEE Trans. 
Neural Netw., 8:1 (1997), 148-154. 








34] W.S. Desarbo, J.D. Carroll, L.A. Clark & P.E. Green, Synthesized cluster- 
ing: A method for amalgamating clustering bases with differential weighting 
variables. Psychometrika, 49 (1984), 57-78. 








35] C. Ding & X. He, Cluster structure of k-means clustering via principal 
component analysis. In: Proc. 8th Pacific-Asia Conf. on Advances in Knowl. 
Discov. Data Min. (PAKDD), Sydney, Australia, 2004, LNCS 3056 (Berlin: 
Springer, 2004), 414-418. 

36] K.L. Du & M.N.S. Swamy, Neural Networks in a Softcomputing Framework 

(London: Springer, 2006). 

37| K.-L. Du, Clustering: A neural network approach. Neural Netw., 23:1 

(2010), 89-107. 

38] J.C. Dunn, Some recent investigations of a new fuzzy partitioning algorithm 

and its applicatiopn to pattern classification problems. J. Cybern., 4 (1974), 

1-15. 

39] P.A. Estevez & C.J. Figueroa, Online data visualization using the neural 

gas network. Neural Netw., 19 (2006), 923-934. 








40] H.A. Fayed & A.F. Atiya, A novel template reduction approach for the K- 
nearest neighbor method. IEEE Trans. Neural Netw., 20:5 (2009), 890-896. 


ww ai bbt.com DOOOO00 











REFERENCES 265 


41| J.A. Flanagan, Self-organization in Kohonen’s SOM. Neural Netw., 9:7 

(1996), 1185-1197. 

42| J.C. Fort, Solving a combinatorial problem via self-organizing process: An 

application of Kohonen-type neural networks to the travelling salesman prob- 

lem. Bio. Cybern., 59 (1988), 33-40. 

43] B. Fritzke, A self-organizing network that can follow nonstationary distri- 
butions. In: W. Gerstner, A. Germond, M. Hasler & J.D. Nicoud, Eds., Proc. 
Int. Conf. Artif. Neural Netw., Lausanne, Switzerland, LNCS 1327 (Berlin: 
Springer, 1997), 613-618. 

44] B. Fritzke, The LBG-U method for vector quantization—An improvement 

over LBG inspired from neural networks. Neural Process. Lett., 5:1 (1997), 

35-45. 

45| K. Fukushima, Neocognitron trained with winner-kill-loser rule. Neural 

Netw., 23 (2010), 926-938. 

46] I. Gath & A.B. Geva, Unsupervised optimal fuzzy clustering. IEEE Trans. 

Pattern Anal. Mach. Intell., 11:7 (1989), 773-781. 

47| A. Gersho, Asymptotically optimal block quantization. IEEE Trans. Inf. 

Theory, 25:4 (1979), 373-380. 

48] A. Ghosh, M. Biehl & B. Hammer, Performance analysis of LVQ algorithms: 

a statistical physics approach. Neural Netw., 19 (2006), 817-829. 

49] S. Grossberg, Adaptive pattern classification and universal recording: I. Par- 

allel development and coding of neural feature detectors; II. Feedback, expec- 

tation, olfaction, and illusions. Biol. Cybern., 23 (1976), 121-34 & 187-202. 

50] D.E. Gustafson & W. Kessel, Fuzzy clustering with a fuzzy covariance 

matrix. In: Proc. IEEE Conf. Decision Contr., San Diego, CA, 1979, 761-766. 

51] L.O. Hall & D.B. Goldgof, Convergence of the single-pass and online fuzzy 

C-means algorithms. IEEE Trans. Fuzzy Syst., 19:4 (2011), 792-794. 

52] B. Hammer, M. Strickert & T. Villmann, Supervised neural gas with general 

similarity measure. Neural Process. Lett., 21:1 (2005), 21-44. 

53] T. Hara & A. Hirose, Plastic mine detecting radar system using complex- 

valued self-organizing map that deals with multiple-frequency interferometric 

images. Neural Netw., 17 (2004), 1201-1210. 

54| P.E. Hart, The condensed nearest neighbor rule. IEEE Trans. Inf. Theory, 

14:3 (1968), 515-516. 

55| R.J. Hathaway & J.C. Bezdek, Switching regression models and fuzzy clus- 

tering. IEEE Trans. Fuzzy Syst., 1 (1993), 195-204. 

56] R.J. Hathaway & J.C. Bezdek, Generalized fuzzy c-means clustering strate- 

gies using Lp norm distances. IEEE Trans. Fuzzy Syst., 8:5 (2000), 576-582. 

57] J. He, A.H. Tan & C.L. Tan, Modified ART 2A growing network capable of 

generating a fixed number of nodes. IEEE Trans. Neural Netw., 15:3 (2004), 

728-737. 

58] L. Holmstrom, P. Koistinen, J. Laaksonen & E. Oja, Neural and statisti- 
cal classifiers-taxonomy and two case studies. IEEE Trans. Neural Netw., 8:1 











ww ai bbt.com DOOOO000 


266 


Chapter 8. Clustering l: Basic clustering models and algorithms 


(1997), 5-17. 
59] J.Z. Huang, M.K. Ng, H. Rong & Z. Li, Automated variable weighting in 
k-means type clustering. IEEE Trans. Pat. Anal. Mach. Intell., 27:5 (2005), 
657-668. 
60] P. Huang & D. Zhang, Locality sensitive C-means clustering algorithms. 
Neurocomput., 73 (2010), 2935-2943. 
61] G.J. Hueter, Solution of the traveling salesman problem with an adaptive 
ring. In: Proc. IEEE Int. Conf. Neural Netw., San Diego, 1988, 85-92 
62] C. Hwang & F. Rhee, Uncertain fuzzy clustering: interval type-2 fuzzy 
approach to C-means. IEEE Trans. Fuzzy Syst., 15:1 (2007), 107-120. 
63] T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman & 
A.Y. Wu, An efficient k-means clustering algorithm: Analysis and implemen- 
tation. IEEE Trans. Pat. Anal. Mach. Intell., 24:7 (2002), 881-892. 
64] N.B. Karayiannis & M.M. Randolph-Gips, Soft learning vector quantization 
and clustering algorithms based on non-Euclidean norms: multinorm algo- 
rithms. IEEE Trans. Neural Netw., 14:1 (2003), 89-102. 
65] U. Kaymak & M. Setnes, Fuzzy clustering with volume prototypes and 
adaptive cluster merging. IEEE Trans. Fuzzy Syst., 10:6 (2002), 705-712. 
66] P.R. Kersten, Fuzzy order statistics and their application to fuzzy clustering. 
IEEE Trans. Fuzzy Syst., 7:6 (1999), 708-712. 








67 


68 


69 
70 


71 


72 


73 
74 


77 


78 








T. Kohonen, Self-organized formation of topologically correct feature maps. 


Biol. Cybern., 43 (1982), 59-69 


T. Kohonen, Self-organization and associative memory (Berlin: Springer, 


1989). 


T. Kohonen, The self-organizing map. Proc. IEEE, 78 (1990), 1464-1480. 
T. Kohonen, Improved versions of learning vector quantization. In: Proc. 


IJCNN, 1990, 1, 545-550 


T. Kohonen, Derivation of a class of training algorithms. IEEE Trans. Neural 


Netw., 1 (1990), 229-232. 


T. Kohonen, Emergence of invariant-feature detectors in the adaptive- 


subspace self-organizing map. Biol. Cybern., 75 (1996), 281-291. 


T. Kohonen, Self-Organizing Maps, 3rd Edn. (Berlin: Springer, 2001). 
T. Kohonen, J. Kangas, J. Laaksonen & K. Torkkola, LVQPAK: A program 


package for the correct application of learning vector quantization algorithms. 
In: Proc. ISCNN, Baltimore, MD, 1992, 1, 725-730. 


75| T. Kohonen, Self-organizing neural projections. Neural Netw., 19 (2006), 
723-733. 
76] J. Kolen & T. Hutcheson, Reducing the time complexity of the fuzzy C- 


means algorithm. IEEE Trans. Fuzzy Syst., 10:2 (2002), 263-267. 


R. Krishnapuram & J. Kim, Clustering algorithms based on volume criteria. 


IEEE Trans. Fuzzy Syst., 8:2 (2000), 228-236. 


H. Kusumoto & Y. Takefuji, O(log M) self-organizing map algorithm with- 


out learning of neighborhood vectors. IEEE Trans. Neural Netw., 17:6 (2006), 


ww ai bbt.com DOOOO00 











REFERENCES 267 


1656-1661. 
79| J. Laaksonen & E. Oja, Classification with learning k-nearest neighbors. In: 
Proc. Int. Conf. Neural Networks, Washington DC, USA, Jun. 1996, 1480- 
1483. 
80] J.Z.C. Lai, T.-J. Huang & Y.-C. Liaw, A fast k-means clustering algorithm 
using cluster center displacement. Pat. Recogn., 42 (2009), 2551-2556. 
81] J. Leski, Towards a robust fuzzy clustering. Fuzzy Sets Syst., 137 (2003), 
215-233. 
82] M.J. Li, M.K. Ng, Y.-m. Cheung & J.Z. Huang, Agglomerative fuzzy K- 
means clustering algorithm with selection of number of clusters. IEEE Trans. 
Knowl. Data Eng., 20:11 (2008), 1519-1534. 
83] A. Likas, N. Vlassis & J.J. Verbeek, The global k-means clustering algo- 
rithm. Pat. Recogn., 36:2 (2003), 451-461. 
84] O. Linda & M. Manic, General type-2 fuzzy C-means algorithm for uncertain 
fuzzy clustering. IEEE Trans. Fuzzy Syst., 20:5 (2012), 883-897. 
85] Y. Linde, A. Buzo & R.M. Gray, An algorithm for vector quantizer design. 
IEEE Trans. Commun., 28 (1980), 84-95. 
86] R.P. Lippman, An introduction to computing with neural nets. IEEE ASSP 
Mag., 4:2 (1987), 4-22. 
87| Z.P. Lo & B. Bavarian, On the rate of convergence in topology preserving 
neural networks. Biol. Cybern., 65 (1991), 55-63. 
88] S.P. Luttrell, Derivation of a class of training algorithms. IEEE Trans. Neu- 
ral Netw., 1 (1990), 229-232. 
89] S.P. Luttrell, A Bayesian analysis of self-organizing maps. Neural Comput., 
6 (1994), 767-794. 
90| J.B. MacQueen, Some methods for classification and analysis of multivariate 
observations. In: Proc. 5th Berkeley Symp. Math. Stat. and Probab., Univer- 
sity of California Press, Berkeley, 1967, 281-297. 
91] T.M. Martinetz, Competitive Hebbian learning rule forms perfectly topology 
preserving maps. In: Proc. Int. Conf. Artif. Neural Netw. (ICANN), Amster- 
dam, 1993, 427—434. 
92] T.M. Martinetz, S.G. Berkovich & K.J. Schulten, Neural-gas network for 
vector quantization and its application to time-series predictions. IEEE Trans. 
Neural Netw., 4:4 (1993), 558-569. 
93] T.M. Martinetz & K.J. Schulten, Topology representing networks. Neural 
Netw., 7 (1994), 507-522. 
94] G. McLachlan & K. Basford, Mixture Models: Inference and Application to 
Clustering (New York: Marcel Dekker, 1988). 
95] B. Moore, ART and pattern clustering. In: D. Touretzky, G. Hinton & T. 
Sejnowski, Eds., Proc. 1988 Connectionist Model Summer School (San Mateo, 
CA: Morgan Kaufmann, 1988), 174-183. 
96] S.A. Mulder & D.C. Wunsch II, Million city traveling salesman problem 
solution by divide and conquer clustering with adaptive resonance neural net- 
works. Neural Netw., 16 (2003), 827-832. 











ww ai bbt.com DOOOO000 


268 


Chapter 8. Clustering l: Basic clustering models and algorithms 


[97] K. Obermayer, H. Ritter & K. Schulten, Development and spatial structure 
of cortical feature maps: a model study. In: R.P. Lippmann, J.E. Moody & 
D.S. Touretzky, Eds., Advances in Neural Information Processing Systems 3 
(San Mateo, CA: Morgan Kaufmann, 1991), 11-17.. 

98] R. Odorico, Learning vector quantization with training count (LVQTC). 
Neural Netw., 10:6 (1997), 1083-1088. 

99] N.R. Pal, J.C. Bezdek & E.C.K. Tsao, Generalized clustering networks and 
Kohonen’s self-organizing scheme. IEEE Trans. Neural Netw., 4:2 (1993), 549- 
557. 

100] N.R. Pal & D. Chakraborty, Mountain and subtractive clustering method: 

Improvements and generalizations. Int. J. Intell. Syst., 15 (2000), 329-341. 

101] G. Patane & M. Russo, The enhanced LBG algorithm. Neural Netw., 14:9 

(2001), 1219-1237. 

102] W. Pedrycz & J. Waletzky, Fuzzy clustering with partial supervision. IEEE 

Trans. Syst. Man Cynern. B, 27:5 (1997), 787-795. 

103] J. Richardt, F. Karl & C. Muller, Connections between fuzzy theory, sim- 

ulated annealing, and convex duality. Fuzzy Sets Syst., 96 (1998), 307-334. 

104] K. Rose, Deterministic annealing for clustering, compression, classifica- 

tion, regression, and related optimization problems. Proc. IEEE., 86:11 (1998), 
2210-2239. 

105] K. Rose, E. Gurewitz & G.C. Fox, A deterministic annealing approach to 

clustering. Pattern Recogn. Lett., 11:9 (1990), 589-594. 

106] A. Sato & K. Yamada, Generalized learning vector quantization. In: G. 

Tesauro, D. Touretzky & T. Leen, Eds., Advances in Neural Information Pro- 

cessing Systems 7 (Cambridge, MA: MIT Press, 1995), 423-429. 

107| T. Serrano-Gotarredona & B. Linares-Barranco, A modified ART 1 algo- 

rithm more suitable for VLSI implementations. Neural Netw., 9:6 (1996), 
1025-1043. 

108] S. Seo & K. Obermayer, Soft learning vector quantization. Neural Comput., 

15 (2003), 1589-1604. 

109] G. Tsekouras, H. Sarimveis, E. Kavakli & G. Bafas, A hierarchical fuzzy- 

clustering approach to fuzzy modeling. Fuzzy Sets Syst., 150:2 (2004), 245- 
266. 

110] S.J. Verzi, G.L. Heileman, M. Georgiopoulos & G.C. Anagnostopoulos, 
Universal approximation with fuzzy ART and fuzzy ARTMAP. In: Proc 
IJCNN, Portland, Oregon, USA, 2003, 3, 1987-1892. 

111] C. von der Malsburg, Self-organizing of orientation sensitive cells in the 

striata cortex. Kybernetik, 14 (1973), 85-100. 

112] D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited 

data. IEEE Trans. Syst. Man Cybern., 2:3 (1972), 408-420. 

113] A. Witoelar, M. Biehl, A. Ghosh & B. Hammer, Learning dynamics and 
robustness of vector quantization and neural gas. Neurocomput., 71 (2008), 
1210-1219. 














ww ai bbt.com DOOOO00 








114 


115 


116 


117 


118 





119 


120 


121 


122 
to 





123 


REFERENCES 269 


K.-L. Wu, M.-S. Yang & J.-N. Hsieh, Mountain c-regressions method. Pat- 


tern Recogn., 43 (2010), 86-98. 


Y. Wu, K. Ianakiev & V. Govindaraju, Improved k-nearest neighbor clas- 


sification. Pattern Recogn., 35 (2002), 2311-2318. 


R.R. Yager & D. Filev, Approximate clustering via the mountain method. 


IEEE Trans. Syst. Man Cybern., 24:8 (1994), 1279-1284. 


E. Yair, K. Zeger & A. Gersho, Competitive learning and soft competition 


for vector quantizer design. IEEE Trans. Signal Process., 40:2 (1992), 294-309. 


M.S. Yang, On a class of fuzzy classification maximum likelihood proce- 


dures. Fuzzy Sets Syst., 57 (1993), 365-375. 


M.-S. Yang, K.-L. Wu, J.-N. Hsieh & J. Yu, Alpha-cut implemented fuzzy 


clustering algorithms and switching regressions. IEEE Trans. Syst., Man, 
Cybern. B, 38:3 (2008), 588-603. 


H. Yin & N.M. Allinson, On the distribution and convergence of the feature 


space in self-organizing maps. Neural Comput., 7:6 (1995), 1178-1187. 


? 


J. Yu, Q. Cheng & H. Huang, Analysis of the weighting exponent in the 


FCM. IEEE Trans SMC B, 34:1 (2004), 634-639. 


J. Yu & M.S. Yang, Optimality test for generalized FCM and its application 
parameter selection. IEEE Trans. Fuzzy Syst., 13:1 (2005), 164-176 


? 


H. Zheng, G. Lefebvre & C. Laurent, Fast-learning adaptive-subspace self- 


organizing map: An application to saliency-based invariant image feature con- 
struction. IEEE Trans. Neural Netw., 19:5 (2008), 746-757. 


ww ai bt. com DOOO000 


9 Clustering Il: topics in clustering 


9.1 The underutilization problem 


Conventional competitive learning based clustering algorithms like C-means and 
LVQ are plagued by a severe initialization problem [109, 58]. If the initial values 
of the prototypes are not in the convex hull formed by the input data, clustering 
may not produce meaningful results. This is the so-called prototype underutiliza- 
tion or dead-unit problem since some prototypes, called dead units, may never 
win the competition. The underutilization problem is caused by the fact that the 
algorithm updates only the winning prototype for every input. 

The underutilization problem is illustrated in Fig. 9.1. There are three clusters 
in the data set. If the three prototypes c1, €2 and c3 are initialized at A, B and C, 
respectively, they will correctly move to the centers of the three clusters. However, 
if they are initialized at A, B, and C’, respectively, C’ will never become a winner 
and thus becomes a dead unit. In the latter case, the system divides the three 
data clusters into two clusters, and the prototypes cı and c2 will, respectively, 
move to the centroids of the two clusters. 

















Figure 9.1 Illustration of the underutilization problem for competitive learning based clustering. 


ww ai bbt.com DOOOO00 


9.1.1 


Clustering II: topics in clustering 271 


In order to alleviate the sensitivity of competitive learning to the initialization 
of the clustering centers, many efforts have been made to solve the underutiliza- 
tion problem. Initializing the prototypes with random input vectors can reduce 
the probability of the underutilization problem, but does not eliminate it. In 
the leaky learning strategy [109, 58], all the prototypes are updated. The win- 
ning prototype is updated by employing a high learning rate, while all the losing 
prototypes move towards the input vector with a much smaller learning rate. 


Competitive learning with conscience 


To avoid the underutilization problem, one can assign each processing unit with a 
threshold, and then increase the threshold if a unit wins, or decrease it otherwise 
[109]. A similar idea is embodied in the conscience strategy, which reduces the 
winning rate of the frequent winners [38]. The frequent winner receives a bad 
conscience by adding a penalty term to its distance from the input signal. This 
leads to an entropy maximization, that is, each unit wins at an approximately 
equal probability. Thus, the probability of underutilized neurons being selected 
as winners is increased. 

Frequency-sensitive competitive learning (FSCL) [6] employs this conscience 
strategy. It reduces the underutilization problem by introducing a distortion 
measure that ensures all codewords in the codebook to be updated with a similar 
probability. The codebooks obtained by FSCL have sufficient entropy so that 
Huffman coding of the vector quantization indices would not provide significant 
additional compression. 

In FSCL, each prototype incorporates a count of the times it has been the 
winner, uj, i = 1,..., K. The distance measure is modified to give prototypes 
with a lower count value a chance to win a competition. The algorithm is similar 
to the vector quantization algorithm, with the only difference being that the 
winning neuron is found by [6] 





Cu(t) = arge; _min {u(t — 1) [lee — es(¢—1)II}, (9.1) 


” 


where w is the index of the winning node, u; is updated by 


u(t—1) +1, i=w 
iis 2 
u(t) fe — 1), otherwise 2) 
and u,;(0)=0, i=1,...,K. In (9.1), u;||a,—ce;|| can be generalized as 


F (uj) |£: — c;||. When selecting the fairness function as F (u;) = yor 170 with 
constants $9 and To, FSCL emphasizes the winning uniformity of codewords 
initially and gradually turns into competitive learning as training proceeds to 
minimize the MSE function. 

In multiplicatively biased competitive learning [29], the competition among the 
neurons is biased by a multiplicative term. The method avoids neuron underuti- 
lization with probability one, as time goes to infinity. Only one weight vector is 
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updated per learning step. FSCL is a member of this family. Fuzzy FSCL [30] 
combines the frequency sensitivity with fuzzy competitive learning. Since both 
FSCL and fuzzy FSCL use a non-Euclidean distance to determine the winner, 
they may lead to the problem of shared clusters in the sense that a number of 
prototypes may be updated into the same cluster during the learning process. 
SOM may yield grid units that are never active, and a number of topographic 
map formation models [11] add a conscience to the weight update process so that 
every grid unit is used equally. 

The habituation mechanism consists in a reversible decrement of the neural 
response to a repetitive stimulus. The response recovers only after a period in 
which there is no activity, and the longer the presynaptic neuron is active, the 
slower it recovers. In [96], habituation has been used to build a growing neural 
network: New units are added to the network when the same neural unit answers 
to many input patterns. In [106], habituation is implemented in an SOM net- 
work and its effects on learning speed and vector quantization are analyzed. The 
conscience learning mechanism follows roughly the same principle but is less 
sophisticated. The proposed habituation mechanism is simple to implement and 
more flexible than conscience learning because it can be used to manage the 
learning process in a fine grained way, also allowing multilayer self-organizing 
structures to be built. Moreover, while conscience learning modifies the compar- 
ison between the input pattern and the neuron weights, habituation only affects 
the activation of the neuron and it adds other variables that allow the learning 
process to be fine-tuned. 


Rival penalized competitive learning 


The problem of shared clusters for FSCL and fuzzy FSCL has been considered in 
the rival penalized competitive learning (RPCL) algorithm [132]. RPCL adds a 
new mechanism to FSCL by creating a rival penalizing force. For each input, not 
only is the winning unit modified to adapt to the input but also the second-place 
winner called a rival is updated by a smaller learning rate along the opposite 
direction, all the other prototypes being unchanged: 


ci(t) + Nw (a: — c (t)), t= w 
e(t+1) = $ ¢(t)-—7, (a —e(t)), i=r : (9.3) 
c;(t), otherwise 


where w and r are the indices of the winning and rival prototypes, which are 
decided by (9.1), and nw and 7, are their respective learning rates. In practice, 
Tw (t) > Mr. Nr is also called the delearning rate for the rival. The principle of 
RPCL is shown in Fig. 9.2. 

This actually pushes the rival away from the sample pattern so as to prevent 
its interference in the competition. RPCL automatically allocates an appropriate 
number of prototypes for an input dataset, and all the extra candidate proto- 
types will finally be pushed to infinity. It provides a better performance than 
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Figure 9.2 Illustration of RPCL. The rival c3 of both cı and c2 is driven out to infinity along a zigzag 
path. 


FSCL. RPCL can be regarded as an unsupervised extension of supervised LVQ2. 
It simultaneously modifies the weight vectors of both the winner and its rival, 
when the winner is in a wrong class but the rival is in a correct class for an input 
vector [132]. Lotto-type competitive learning [90] can be treated as a generaliza- 
tion of RPCL, where instead of just penalizing the nearest rival, all the losers 
are penalized equally. Generalized lotto-type competitive learning [91] modifies 
lotto-type competitive learning by allowing more than one winner, which are 
divided into tiers, with each tier being rewarded differently. However, RPCL 
may encounter the overpenalization and underpenalization problems due to an 
inappropriate delearning rate [145]. Generalized RPCL [133] distributes learning 
and penalization to groups of agents proportionally to their activation strengths, 
that is, RPCL is extended to a soft-competitive mechanism comprising multiple 
winners and rivals. 

Stepwise automatic rival-penalized (STAR) C-means [25] consists of two sep- 
arate phases. The first phase implements FSCL, which assigns each cluster with 
at least a prototype. The second phase is derived from a Kullback-Leibler diver- 
gence based criterion, and adjusts the units adaptively by a learning rule that 
automatically penalizes the winning chance of all rival prototypes in subsequent 
competitions while tuning the winning one to adapt to the input. STAR C- 
means has a mechanism similar to RPCL, but penalizes the rivals in an implicit 
way, whereby circumventing the determination of the rival delearning rate of 
RPCL. STAR C-means is applicable to ellipse-shaped data clusters as well as 
sphere-shaped ones without the underutilization problem and without having to 
predetermine the correct cluster number. 

The convergence problem of RPCL is investigated via a cost-function approach 
n [92]. A general form of RPCL, called distance-sensitive RPCL, is proved to 
be associated with the minimization of a cost function on the weight vectors of 
a competitive learning network. As a distance-sensitive RPCL process decreases 
the cost to a local minimum, a number of weight vectors eventually fall into a 
hypersphere surrounding the sample data, while the other weight vectors diverge 
to infinity. If the cost reduces into the global minimum, a correct number of 
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weight vectors is automatically selected and located around the centers of the 
actual clusters. 

The performance of RPCL is very sensitive to the selection of the delearning 
rate. Rival penalized controlled competitive learning (RPCCL) [26] generalizes 
RPCL; it dynamically controls the rival-penalizing forces. RPCCL always sets 
the delearning rate as the learning rate. A rival seed point is always penalized 
with the delearning rule. The RPCCL mechanism fully penalizes the rival if the 
winner suffers from severe competition from the rival; otherwise, the penaliza- 
tion strength should be proportional to the degree of competition level. RPCCL 
can be implemented in a stochastic version, called stochastic RPCL. Stochastic 
RPCL penalizes the rivals by using the same rule as RPCL, but the penalization 
is performed stochastically. 

Competitive repetition-suppression (CoRe) learning [7] is inspired by a cor- 
tical memory mechanism called repetition suppression. CoRe learning produces 
sharp feature detectors and compact neural representations of the input stimuli 
through a mechanism of neural suppression and strengthening that is dependent 
on the frequency of the stimuli. CoRe clustering can automatically estimate 
the unknown cluster number from the data without a priori information of the 
input distribution. It is a robust extension of RPCL, by combining the biological 
inspiration with the robustness properties of the M-estimators. CoRe cluster- 
ing generalizes RPCL by allowing winner and rival competitors to refer to the 
set of units (instead of single neurons), and implementing a soft-competitive 
rival penalization scheme. Each CoRe neuron acts as a cluster detector and 
the repetition-suppression mechanism is used to selectively suppress irrelevant 
neurons, consequently determining the unknown cluster number. The repetition- 
suppression mechanism offers a mean for adaptively controlling rival penaliza- 
tion. 

Rival-model penalized SOM [27], for each input, adaptively chooses several 
rivals of the best-matching unit and penalizes their associated prototypes a little 
far away from the input. Each node on the grid is associated with a prototype. 
Rival-model penalized SOM utilizes a constant learning rate, but still reaches 
a robust result. The rival is not selected as the second nearest neuron, because 
the one-neighborhood neurons of the best-matching unit are usually topologically 
close to the best-matching unit, and penalizing the second nearest neuron violates 
the intrinsic characteristic of SOM. 


Softcompetitive learning 


By relaxing the WTA criterion, clustering methods can treat more than a single 
neuron as winners to a certain degree and update their prototypes accordingly, 
resulting in the winner-takes-most paradigm, namely, softcompetitive learning. 
Examples are soft competition scheme [134], SOM, neural gas, maximum-entropy 
clustering [108], generalized LVQ, FCM and fuzzy competitive learning [30]. The 
winner-takes-most criterion, however, detracts some prototypes from their cor- 
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responding clusters, since all the prototypes are attracted to each input pattern; 
consequently it becomes biased toward the global mean of the clusters [89]. 

Fuzzy competitive learning [30] is a class of sequential algorithms obtained by 
fuzzifying competitive learning algorithms, such as simple competitive learning, 
unsupervised LVQ, and FSCL. The concept of winning is formulated as a fuzzy 
set and the network outputs become the winning memberships of the competing 
neurons. Enhanced sequential fuzzy clustering [141] is a modification to fuzzy 
competitive learning, obtained by introducing a nonunity weighting on the win- 
ning centroid and an excitation-inhibition mechanism so as to better overcome 
the underutilization problem. 

SOM is a learning process, which takes the winner-takes-most strategy at the 
early stages and becomes a WTA approach, while its neighborhood size reduces 
to unity as a function of time in a predetermined manner. Due to the soft- 
competitive strategy, SOM, growing neural gas (GNG) [51] and fuzzy clustering 
algorithms are less likely to be trapped at local minima and to generate dead 
units than hard competitive alternatives [10]. 

Maximum-entropy clustering [108] avoids the underutilization problem and 
local minima in the error function by using softcompetitive learning and deter- 
ministic annealing. The prototypes are updated by 


oe Bllare—es(t) I? 


c;(t +1) = clt) + n(t) SE, leno 





| eeu, tH Ape. K, 


(9.4) 
where 77 is the learning rate and the parameter 3 anneals from a large parameter 
to zero, and the term within the brackets turns out to be the Boltzmann distribu- 
tion. Soft competition scheme [134] employs a similar softcompetitive strategy, 
but G=1. 


Robust clustering 


Data sets usually contain noisy points or outliers. The robust statistics approach 
has been integrated into robust clustering methods [19, 79, 35, 49, 65]. C-median 
clustering [19] is derived by solving a bilinear programming problem that uti- 
lizes the Ly-norm distance. Fuzzy C-median clustering [79] is a robust FCM 
method that uses Lı-norm with the exemplar estimation based on fuzzy median. 
An approach to fuzzy clustering proposed in [23] combines the benefits of local 
dimensionality reduction within each cluster using factor analysis and the robust- 
ness to outliers of Student-t distributions. 

The clustering of a vectorial data set with missing entries belongs to the cate- 
gory of robust clustering. In [66], four strategies, namely, the whole data, partial 
distance, optimal completion and nearest prototype strategies have been dis- 
cussed for implementing FCM clustering for incomplete data. 
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Robust competitive agglomeration [49] combines the advantages of hierarchi- 
cal and partitional clustering techniques. The objective function also contains 
a constraint term given in the form of (9.11). An optimum number of clusters 
is determined via a process of competitive agglomeration, while the knowledge 
of the global shape of the clusters is incorporated by using prototypes. Robust 
statistics like the M-estimator is incorporated to handle outliers. Overlapping 
clusters are handled by the use of fuzzy memberships. 

Robust clustering algorithms can be derived by optimizing a specially designed 
objective function, which usually has two terms: 


Er = E + E,, (9.5) 


where F is the cost for the conventional algorithms such as (8.40), and the 
constraint term E, characterizes the outliers. 

The noise-clustering algorithm can be treated as a robustified FCM. In the 
noise-clustering approach [33], noise and outliers are collected into a separate, 
amorphous noise cluster, whose prototype has the same distance, 6, from all the 
data points. The other points are collected into K clusters. The threshold ô is a 
relatively high value compared to the distances of the good points to the cluster 
prototypes. If a noisy point is far away from all the K clusters, it is attracted 
to the noise cluster. When all the K clusters have about the same size, noise 
clustering is very effective. However, a single threshold is too restrictive if the 
cluster size varies widely in the data set. 

In noise clustering, the second term is given by 


N K m 
Ee = ee l= 5 Uji . (9.6) 
i=1 j=1 


Following the procedure for the derivation of FCM, we have 
(mor) 
[æ:—c; |] 


a es a (9.7) 
Dia (eter) +(e) 
The second term in the denominator, which is due to outliers, leads to small gji. 
The formula for the prototypes is the same as that for FCM, given by (8.43). 
In a fuzzified PCA-guided robust C-means method [71], a robust C-means 
partition is derived by using a noise-rejection mechanism based on the noise- 
clustering approach. The responsibility weight of each sample for the C-means 
process is estimated by considering the noise degree of the sample, and cluster 
indicators are calculated in a fuzzy PCA guided manner, where fuzzy PCA guided 
robust C-means is performed by considering responsibility weights of samples. 
Relational data can be clustered by using non-Euclidean relational FCM [64]. 
A number of fuzzy clustering algorithms for relational data have been reviewed 
in [36]. The introduction of the concept of noise clustering into these relational 
clustering techniques leads to their robust versions [36]. Weighted non-Euclidean 


ww ai bbt.com DOOOO00 


9.2.1 


Clustering II: topics in clustering 277 


relational FCM [67] reduces the original data set to a smaller one, assigns each 
selected datum a weight reflecting the number of nearby data, clusters the 
weighted reduced data set using a weighted version of the feature or relational 
data FCM, and if desired, extends the reduced data results back to the original 
data set. 

The three important issues associated with competitive learning clustering are 
initialization, adaptation to clusters of different size and sparsity, and eliminat- 
ing the disturbance caused by outliers. Energy-based competitive learning [129] 
simultaneously tackles these problems. Initialization is achieved by extracting 
samples of high energy to form a core point set, whereby connected components 
are obtained as initial clusters. To adapt to clusters of different size and sparsity, 
size-sparsity balance of clusters is developed to select a winning prototype, and 
a prototype energy weighted squared distance objective function is defined. For 
eliminating the disturbance caused by outliers, adaptive learning rate based on 
samples’ energy is proposed to update the winner. 


Possibilistic C-means 


Possibilistic C-means (PCM) [83], as opposed to FCM, does not require that the 
sum of the memberships of a data point across the clusters be unity. This allows 
the membership functions to represent a possibility of belonging rather than a 
relative degree of membership between clusters. As a result, the derived degree 
of membership does not decrease as the number of clusters increases. Due to the 
elimination of this constraint, the modified objective function is decomposed into 
many individual objective functions, one for each cluster, which can be optimized 
separately. 

The constraint term for PCM is given by a sum associated with the fuzzy 
complements of all the K clusters 


K N 
a=} 8, ya, (9.8) 
j=1 i=l 
where ĝj are suitable positive numbers. The individual objective functions are 
given as 
N N 
EL =) uF lles— el? +4 > Aa j=l RK (9.9) 
i=1 i=1 


Differentiating (9.9) with respect to uj; and setting it to zero leads to the solution 


I 
hi = —————_. (9.10) 


1+ læ:—c;ll? mT 
Bi 


J 


where the second term in the denominator is large for outliers, leading to small 
lji- Some heuristics for selecting 8; have also been given in [83]. 
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Given a number of clusters K, FCM will arbitrarily split or merge real clusters 
in the data set to produce exactly the specified number of clusters. PCM, in 
contrast to FCM, can find those natural clusters in the data set. When K is 
smaller than the number of actual clusters, only K good clusters are found, and 
the other data points are treated as outliers. When K is larger than the number 
of actual clusters, all the actual clusters can be found and some clusters will 
coincide. Thus K can be specified somewhat arbitrarily. 

In the noise-clustering algorithm, there is only one noise cluster, while in PCM 
there are K noise clusters. PCM functions as a collection of K independent 
noise-clustering algorithms, each looking for a single cluster. The performance 
of PCM, however, relies heavily on good initialization of cluster prototypes and 
estimation of 8j, and PCM tends to converge to coincidental clusters [83, 35]. 
There have been many efforts to improve the stability of possibilistic clustering 
[142]. Possibilistic FCM [102] provides a hybrid model of FCM and PCM; it 
performs effectively for low-dimensional data clustering. 

Most fuzzy clustering methods can only process the spatial data instead of the 
nonspatial data. In [124], similarity-based PCM is proposed to cluster nonspatial 
data without requesting users to specify the cluster number. It extends PCM 
for similarity-based clustering applications by integration with the mountain 
method. Rough-fuzzy possibilistic C-means [93] comprises a judicious integration 
of the principles of rough and fuzzy sets. It incorporates both probabilistic and 
possibilistic memberships simultaneously to avoid the noise sensitivity of FCM 
and the coincident clusters of PCM. 


A unified framework for robust clustering 


By extending the idea of treating outliers as the fuzzy complement, a family 
of robust clustering algorithms has been obtained [136]. Assume that a noise 
cluster exists outside each data cluster. The fuzzy complement of uji, denoted as 
f (uji), may be interpreted as the degree to which x; does not belong to the ith 
data cluster. Thus, the fuzzy complement can be viewed as the membership of 
x; in the noise cluster with a distance 3;. Based on this, one can propose many 
different implementations of the probabilistic approach [35, 136]. For robust fuzzy 
clustering, a general form of E, is given as a generalization of that for PCM [136] 


N K 
Ee=X Y Glial”. (9.11) 


i=1 j=1 
Notice that PCM uses the standard fuzzy complement f (uji) = 1 — Hji- 

By setting to zero the derivatives of Er with respect to the variables, a fuzzy 
clustering algorithm is obtained. For example, by setting m = 1 and f (uji) = 
Hji ln (uji) — Hji [35] or f (uji) =1+ Lyi In (tze) — Lyi [136], we can obtain 


_ leize; l? 


Hi=ze "io, (9.12) 
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and cj has the same form as that for FCM. The alternating cluster estimation 
method [110] is a simple extension of the general method given in [35] and [136]. 
B; can be adjusted by [110] 


b; = mine- cl, kk #5. (9.13) 


The fuzzy robust C-spherical shells algorithm [136] searches the clusters 
belonging to the spherical shells by combining the concept of fuzzy complement 
and the fuzzy C-spherical shells algorithm [82]. The hard robust clustering algo- 
rithm [136] is obtained by setting @; = oo if neuron j is the winner and setting 
B; by (9.18) if it is a loser. In these robust algorithms, the initial values and 
adjustment of 3; are very important. 


Supervised clustering 


Conventional clustering methods are unsupervised clustering, where unlabeled 
patterns are involved. When output patterns are used in clustering, this yields 
supervised clustering methods. The locations of the cluster centers are influenced 
not only by the input pattern spread, but also by the output pattern deviations. 

For classification problems, the classmembership of each pattern in the training 
set is available and can be used for clustering. Supervised clustering develops 
clusters preserving the homogeneity of the clustered patterns with regard to 
their similarity in the input space, as well as their respective values assumed 
in the output space. Examples of supervised clustering methods include LVQ 
family, ARTMAP family and conditional FCM [103]. For classification problems, 
supervised clustering significantly improves the decision accuracy. 

In the case of supervised learning using kernel-based neural networks such as 
the RBF network, the structure (kernels) is usually determined by using unsu- 
pervised clustering. This method, however, is not effective for finding a par- 
simonious network. Supervised clustering can be implemented by augmenting 
le so as to obtain an 
improved distribution of the cluster centers by unsupervised clustering [103, 110]. 


the input pattern with its output pattern, x; = [ees By? 
A scaling factor 3 balances between the underlying similarity in the input space 


and the similarity in the output space. The resulting objective function in the 
case of FCM is given by 


K N K N 
E=X X pille- ceil? +9 Y e Bye — cusl? , (9.14) 


j=1 i=1 j=1 i=1 
T aT T 
where the new cluster center cj = |c} jy i] . The first term corresponds to 


FCM, and the second term applies to supervised learning. The resulting cluster 
codebook vectors are rescaled and projected onto the input space to obtain the 
centers. 
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Conditional FCM [103] is based on FCM, but requires the output variable 
of a cluster to satisfy a particular condition. This condition can be treated as 
a fuzzy set, defined via the corresponding membership. A family of generalized 
weighted conditional FCM clustering algorithms have been derived in [87]. Semi- 
supervised enhancement of FCM is given in [17], where a kernel-based distance 
is applied. 

Based on enhanced LBG, clustering for function approximation [57] is specially 
designed for function approximation problems. The method increases the density 
of the prototypes in the input areas where the target function presents a more 
variable response, rather than just in the zones with more input examples [57]. 
In [117], a prototype regression function is built as a linear combination of local 
linear regression models, one for each cluster, and is then inserted into the FCM 
functional. In this way, the prototypes can be adjusted according to both the 
input distribution and the regression function in the output space. 


Clustering using non-Euclidean distance measures 


Conventional clustering methods are based on the Euclidean distance, which 
favors hyperspherically shaped clusters of equal size. The Euclidean distance 
measure results in the undesirable property of splitting big and elongated clus- 
ters. Other distance measures can be defined to search for clusters of specific 
shapes in the feature space. 

The Mahalanobis distance can be used to look for hyperellipsoid-shaped clus- 
ters. It is used in Gustafson-Kessel clustering and adaptive fuzzy clustering. 
However, C-means with the Mahalanobis distance tends to produce unusually 
large or unusually small clusters. The hyperellipsoidal clustering network [95] 
integrates PCA and clustering into one network, and can adaptively estimate 
the hyperellipsoidal shape of each cluster. Hyperellipsoidal clustering implements 
clustering using a regularized Mahalanobis distance, which is a linear combina- 
tion of the Mahalanobis distance and the Euclidean distance, to prevent from 
producing unusually large or unusually small clusters. 

Symmetry-based C-means [119] can effectively find clusters with symmetric 
shapes, such as the human face. The method employs the point-symmetry dis- 
tance as the dissimilarity measure. The point-symmetry distance is defined by 


dji =d(æj,ci)= min Ile; = c) + (æ - ci) 


, (9.15) 
p=1,....N, p#5 |æ; — cill + |lap — cill 








where c; is a prototype vector, and the pattern set {x;} is of size N. Notice that 
dji =0 only when x, = 2c; — £j. Symmetry-based C-means uses C-means as 
a coarse search for the K cluster centroids. A fine-tuning procedure is then 
performed based on the point-symmetry distance using the nearest-neighbor 
paradigm. 
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Table 9.1. A family of shell clustering algorithms 


Algorithm Cluster shape 

Fuzzy C-varieties [15] Line segments and lines 
Fuzzy C-shells [32], [34] Circles and ellipses 

Hard C-spherical shells [82] Circles and spheres 
Unsupervised C-spherical shells [82] Circles and spheres 

Fuzzy C-spherical shells [82] Circles and spheres 
Possibilistic C-spherical shells [83] Circles and spheres 

Fuzzy C-rings [94] Circles 

Fuzzy C-ellipses [55] Ellipses 

Fuzzy C-means Spheres 

Gustafson-Kessel Ellipsoids and possibly lines 
Gath-Geva Ellipsoids and possibly lines 
Fuzzy C-elliptotype [111] Ellipsoids and possibly lines 
Fuzzy C-quadric shells [84] Linear and quadric shell 
Norm-induced shell prototypes [16] Rectangular shells 

Fuzzy C-rectangular shells [70] Rectangular shells, rectan- 


gles/polygons (approximation of 
circle, lines, ellipses) 


To deal efficiently with pattern recognition and image segmentation in which 
we could encounter various geometrical shapes of the clusters, a number of shell 
clustering algorithms for detecting circles and hyperspherical shells have been 
proposed as extensions of C-means and FCM. 

Fuzzy C-varieties method [15] can be regarded as a simultaneous algorithm 
of fuzzy clustering and PCA, in which the prototypes are multidimensional lin- 
ear varieties represented by some local principal component vectors. Fuzzy C- 
shells method is successful in clustering spherical shells and it has been further 
generalized to adaptive fuzzy C-shells for the case of elliptical shells [32], [34]. 
Fuzzy C-spherical shells method [82] reduces the computational cost of fuzzy C- 
shells by introducing an algebraic distance measure. For two-dimensional cases, 
fuzzy C-rings method [94] is used for clustering ring data, while fuzzy C-ellipses 
method [55] is for elliptical data. Fuzzy C-quadric shells method [84] detects 
quadrics-like circles, ellipses, hyperbolas, or lines. The clustering algorithms for 
detecting rectangular shells include norm-induced shell prototypes [16] and fuzzy 
C-rectangular shells [70]. 

The above approaches are listed in Table 9.1. Like FCM, they suffer from 
three problems: lack of robustness against noisy points, sensitivity to prototype 
initialization, and requiring a priori knowledge of the optimal cluster number. 
Based on fuzzy C-spherical shells, information fuzzy C-spherical shells [116] are 
for robust fuzzy clustering of spherical shells of outlier detection, prototype ini- 
tialization and cluster validity in a unified framework of information clustering. 
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These algorithms for shell clustering are based on iterative optimization of 
objective functions similar to that for FCM, but defines the distance from a 
prototype A; = (cj, ri) to the point æ; as 


dj = @ (æj, Az) = (læ; — all- rs)”, (9.16) 


where c; is the center of the hypersphere and r; is the radius. They can effectively 
estimate the optimal number of substructures in the data set by using some 
validity criteria such as spherical-shell thickness [82], fuzzy hypervolume and 
fuzzy density [54, 94]. By using different distance measures, many clustering 
algorithms can be derived for detecting clusters of various shapes such as lines 
and planes [78], and ellipsoids [78]. 


Partitional, hierarchical and density-based clustering 


Existing clustering algorithms are broadly classified into partitional, hierarchical 
and density-based clustering. Clustering methods discussed thus far belong to 
partitional clustering. 

Partitional clustering can be either hard clustering or fuzzy clustering. Fuzzy 
clustering can deal with overlapping cluster boundaries. Partitional clustering is 
dynamic, where points can move from one cluster to another. Knowledge of the 
shape or size of the clusters can be incorporated by using appropriate prototypes 
and distance measures. Due to the optimization of a certain criterion function, 
partitional clustering is sensitive to initialization and susceptible to local minima. 
It has difficulty in determining the suitable number of clusters K. In addition, it 
is also sensitive to noise and outliers. Typical partitional clustering algorithms 
have a computational complexity of O(N), for a training set of size N. 

Hierarchical clustering consists of a sequence of partitions in a hierarchical 
structure, which can be represented graphically as a clustering tree, called a 
dendrogram. It can be either an agglomerative or a divisive technique. Hierarchi- 
cal clustering usually takes form of agglomerative clustering. New clusters are 
formed by reallocating the membership degree of one point at a time, based on 
some measure of similarity or distance. It is suitable for data with dendritic sub- 
structure. Divisive clustering performs in a way opposite to that of agglomerative 
clustering, but is computationally more costly. 

Hierarchical clustering has a number of advantages over partitional clustering. 
In hierarchical clustering, outliers can be easily identified, since they merge with 
other points less often due to their larger distances from the other points. Con- 
sequently, the number of points in a collection of outliers is typically much less 
than the number in a cluster. In addition, the number of clusters K does not need 
to be specified, and the local minimum problem arising from initialization is no 
longer a problem any more. However, prior knowledge of the shape or size of the 
clusters cannot be incorporated, and consequently overlapping clusters cannot 
always be separated. Moreover, hierarchical clustering is static, and points com- 
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mitted to a given cluster in the early stages cannot move to a different cluster. 
It typically has a computational complexity of at least O (N Py which makes it 
impractical for larger data sets. 

Density-based clustering groups neighboring objects of a data set into clusters 
based on density conditions. Clusters are dense regions of objects in the data 
space and are separated by regions of low density. Density-based clustering is 
robust against outliers since an outlier affects clustering only in the neighborhood 
of this data point. It can handle outliers and discover clusters of arbitrary shape. 
The computational complexity of density-based clustering is of the same order 
of magnitude as that of hierarchical algorithms. 


Hierarchical clustering 


Distance measures, cluster representations and dendrograms 


The two simplest and well-known agglomerative clustering algorithms are the 
single linkage [115] and complete linkage [80] algorithms. The single linkage algo- 
rithm, also called the nearest-neighbor paradigm, is a bottom-up approach that 
generates clusters by sequentially merging pairs of similar clusters. The technique 
calculates the intercluster distance using the two closest data points in different 
clusters 

d(Ci,C2)= min d(z,y), (9.17) 


wel, ,yeC2 


where d(C ,C2) denotes the distance between clusters Cı and C2, and d(x, y) 
the distance between data points x and y. The single linkage technique is more 
suitable for finding well-separated stringy clusters. 

In contrast, the complete linkage method uses the farthest distance between 
any two data points in different clusters to define the intercluster distance. 
Both single linkage and complete linkage algorithms require time complexity of 
O(N? log N), for N points. The clustering results for the two methods are illus- 
trated in Fig. 9.3. Other more complicated methods are group average linkage, 
median linkage and centroid linkage methods. 

The representation of clusters is also necessary in hierarchical clustering. 
Agglomerative clustering can be based on the centroid [143], all-points [139] 
and scatter-points [59] representations. The shape and extent of a cluster are 
conventionally represented by its centroid or prototype. This is desirable only 
for spherically shaped clusters, but causes cluster splitting for a large or arbi- 
trarily shaped cluster, since the centroids of its subclusters can be far apart. At 
the other extreme, all data points in a cluster are used as its representatives, and 
this makes the clustering algorithm extremely sensitive to noisy data points and 
outliers. This all-points representation can cluster arbitrary shapes. The scatter- 
points representation, as a tradeoff between the two extremes, represents each 
cluster by a certain fixed number of points that are generated by selecting well- 
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Figure 9.3 Agglomerative clustering results. (a) The single-link algorithm. (b) The complete-link 
algorithm. 
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Figure 9.4 A single-linkage dendrogram. 


scattered points from the cluster and then shrinking them toward the center of 
the cluster by a specified fraction. This reduces the adverse effects of the outliers 
since the outliers are typically farther away from the mean and are thus shifted 
by a larger distance due to shrinking. The scatter-points representation achieves 
robustness to outliers, and identifies clusters having nonspherical shape and wide 
variations in size. For large data sets, storage or multiple input/output scans of 
the data points is a bottleneck for existing clustering algorithms. 
Agglomerative clustering starts from N clusters, each containing exactly one 
data point. A series of nested merging is performed until finally all the data 
points are grouped into one cluster. Agglomerative clustering processes a set 
of N? numerical relationships between the N data points, and agglomerates 
according to their similarity, usually measured by a distance. It is based on a local 
connectivity criterion. The run time is O (N ge The process of agglomerative 
clustering can be easily illustrated by using a dendrogram, as shown in Fig. 9.4. 
The process of successive merging of the clusters is guided by the set distance 
Ômin- At a cross-section with dmin, the number of clusters can be decided. At the 
cross-section shown in Fig. 9.4, there are three clusters: {a,b,c}, {d,e, f,g}, and 


{h,i}. 
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Figure 9.5 Illustration of an MST. By removing the two longest edges, three clusters are obtained. 


9.6.2 


Minimum spanning tree (MST) clusterng 


MST-based clustering [139] is a conventional agglomerative clustering technique. 
The MST method is a graph-theoretical technique [121]. It uses the all-points 
representation. MST clustering uses single linkage and first finds an MST for the 
input data. Then, by removing the longest K — 1 edges, K clusters are obtained, 
as shown in Fig. 9.5. Initially each point is a separate cluster. An agglomerative 
algorithm starts with the disjoint set of clusters. Pairs of clusters with minimum 
distance are then successively merged until a criterion is satisfied. MST clustering 
is good at clustering arbitrary shapes, and it has the ability to detect clusters 
with irregular boundaries. It has a complexity of O(N’). 

In an MST graph, two points or vertices can be connected either by a direct 
edge, or by a sequence of edges called a path. The length of a path is the number 
of edges on it. The degree of link of a vertex is the number of edges that link 
to this vertex. A loop in a graph is a closed path. A connected graph has one 
or more paths between every pair of points. A tree is a connected graph with 
no closed loop. A spanning tree is a tree that contains every point in the data 
set. When a value is assigned to each edge in a tree, we get a weighted tree. The 
weight for each edge can be the distance between its two end points. The weight 
of a tree is the total sum of the edge weights in the tree. An MST is a spanning 
tree that have the minimal total weight. Two properties used to identify edges 
provably in an MST are the cut and cycle properties. The cut property states 
that the edge with the smallest weight crossing any two partitions of the vertex 
set belongs to the MST. The cycle property states that the edge with the largest 
weight in any cycle in a graph cannot be in the MST. For MST clustering, the 
weight associated with each edge denotes a distance between the two end points. 

An MST can be constructed using either Prim’s algorithm [104] or Kruskal’s 
algorithm [85]. Both algorithms grow the tree by adding one edge at a time. 
The cost of constructing an MST is O(m log n) for n vertices and m edges [104], 
[85]. The reverse-delete algorithm is the reverse of Kruskal’s algorithm; it starts 
with the full graph and delete edges in order of nonincreasing weights based on 
the cycle property as long as doing so does not disconnect the graph. In MST 
clustering, the inputs are a set of N data points and a distance measure defined 
upon them, and the time complexity of Kruskal’s algorithm, Prim’s algorithm 
and the reverse-delete algorithm is O(N?) [128]. k-d tree and Delaunay trian- 
gulation have been employed in the construction of an MST to reduce the time 
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Figure 9.6 The MST for the data. (a) The locations of the airports. (b) The generated MST. 


9.6.3 


complexity to near O(N log N), but they work well only for dimensions no more 
than 5 [13]. 


Example 9.1: Consider a data set of 456 airports in U.S.. We have the mean 
travel time between those airports, along with their latitude and longitude. The 
locations of the airports and the corresponding MST in terms of travel time are 
shown in Fig. 9.6. The MST is generated using Prim’s algorithm, and the result 
is based on the Gaimc package at MATLAB Central, provided by David Gleich. 


A fast MST-inspired clustering algorithm [128] tries to identify the relatively 
small number of inconsistent edges and remove them to form clusters before 
the complete MST is constructed. It can have a much better performance than 
O(N?) by using an efficient implementation of the cut and the cycle property of 
the MSTs. A more efficient method that can quickly identify the longest edges 
in an MST is also presented in [128]. 

MST-based clustering is very sensitive to the outliers, and it may merge two 
clusters due to the existence of a chain of outliers connecting them. With an 
MST being constructed, the next step is to define an edge inconsistency measure 
so as to partition the tree into clusters. In real-world tasks, outliers often exist, 
and this makes the longest edges an unreliable indication of cluster separations. 
In these cases, all the edges that satisfy the inconsistency measure are removed 
and the data points in the smallest clusters are regarded as outliers. 


BIRCH, CURE, CHAMELEON and DBSCAN 


The BIRCH (balanced iterative reducing and clustering using hierarchies) 
method [143], [144] first performs an incremental and approximate preclustering 
phase in which dense regions of points are represented by compact summaries, 
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and then a centroid-based hierarchical algorithm is used to cluster the set of 
summaries. During preclustering, the entire database is scanned, and cluster 
summaries are stored in an in-memory data structure called a clustering feature 
tree. BIRCH uses cluster features to represent a subcluster. Given the cluster 
features of a subcluster, one can obtain the centroid, radius and diameter of that 
subcluster easily (in constant time). Furthermore, the cluster features vector of 
a new cluster formed by merging two subclusters can be directly derived from 
the cluster features vectors of the two subclusters by algebraic operations. On 
several large data sets, BIRCH is significantly superior to CLARANS [98] and 
C-means in terms of quality, speed, stability and scalability overall on large data 
sets. Robustness is achieved by eliminating outliers from the summaries via the 
identification of sparsely distributed data points in feature space. The clustering 
feature tree grows by aggregation with only one pass over the data, thus hav- 
ing a complexity of O(N). Only one scan is needed to obtain good clustering 
results. One or more additional passes can be used to further improve the clus- 
tering qualities. BIRCH is not sensitive to the input order of the data. However, 
BIRCH fails to identify clusters with nonspherical shapes or wide variation in 
size by splitting larger clusters and merging smaller clusters. 

CURE (clustering using representation) [59] is a robust clustering algorithm 
based on the scatter-points representation. It is an improvement of the single- 
linkage algorithm. CURE selects several scattered data points carefully as the 
representatives for each cluster and shrinks these representatives toward the 
centroid in order to eliminate the effects of outliers and avoid the chaining effect. 
The distance between two clusters in CURE is defined as the minimal distance 
between the two representatives of each cluster. In each iteration, it merges the 
two closest clusters. CURE clusters data of any shape. To handle large databases, 
CURE employs a combination of random sampling and partitioning to reduce 
the computational complexity. Random samples drawn from the data set are 
first partitioned and each partition is partially clustered. The partial clusters are 
then clustered in a second pass to yield the desired clusters. CURE uses the k-d 
tree to search the nearest representatives and heap data structures. However, 
the k-d tree searching structure does not work well in a high-dimensional data 
set [13]. CURE has a computational complexity of O (N?) for low-dimensional 
data, which is no worse than that of the centroid-based hierarchical algorithm. 
It provides better performance with less execution time compared to BIRCH. 
CURE can discover clusters with interesting shapes and is less sensitive than 
MST to the outliers. 

CHAMELEON [76] is a hybrid clustering algorithm. It first creates a graph, 
where each node represents a pattern and the edges between each node and all 
the other nodes exist according to the k-NN paradigm. A graph-partitioning 
algorithm is used to recursively partition the graph into many small uncon- 
nected subgraphs, each partitioning yielding two subgraphs of roughly equal size. 
Agglomerative clustering is applied, each subcluster being used as an initial sub- 
cluster. CHAMELEON merges two subclusters only when the interconnectivity 
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Figure 9.7 Clustering using DBSCAN. (a) The data set. (b) The clustering result. 


(the number of links between two clusters) as well as the closeness (the length of 
those links) of the individual clusters is very similar. It automatically adapts to 
the characteristics of the clusters being merged. CHAMELEON is more effective 
than CURE in discovering clusters of arbitrary shapes and varying densities. It 
has a computational complexity of O (N°) [76]. 

DBSCAN (density-based spatial clustering of applications with noise) [43] is 
a well-known density-based clustering algorithm. DBSCAN is designed to dis- 
cover clusters of arbitrary shape as well as to distinguish noise. In DBSCAN, 
a region is defined as the set of points that lie in the e-neighborhood of some 
point p. Cluster label propagation from p to the other points in a region R 
happens if |R|, the cardinality of R, exceeds a given threshold for the mini- 
mal number of points. Generalized DBSCAN [112] can cluster point objects as 
well as spatially extended objects according to both their spatial and nonspatial 
attributes. DBSCAN defines two specified parameters for a single density, thus 
it does not perform well to datasets with varying densities. DDBSCAN (double- 
density-based SCAN) [24] needs two different densities as input parameters. 


Example 9.2: We apply DBSCAN on a data set of three clusters. The threshold 
for the minimal number of points is 3, and the neighborhood radius for DBSCAN 
€ is obtained by analysis method. The clustering result is shown in Fig. 9.7. Those 
points with —1 are identified as outliers. 


Affinity propagation [47] is a distance-based algorithm for identifying exem- 
plars in a data set. The method executes a message-passing process as well as 
an iterative process to find out the final exemplars in a data set. It selects some 
existing points in the data set as exemplars. It does not need to input any param- 
eter in advance. However, affinity propagation usually breaks the shapes of the 
clusters and partitions them into patches. APSCAN (affinity propagation spatial 
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clustering of applications with noise) [24] is a parameter-free clustering algorithm 
that combines affinity propagation and DDBSCAN. APSCAN does not need to 
predefine the two parameters as required in DBSCAN; it not only can cluster 
data sets with varying densities but also preserve the nonlinear data structure 
for such data sets. APSCAN can take the shape of clusters into account and 
preserve the irregular structure of clusters. 

DBSCAN and BIRCH give promising results for low-dimensional data. Online 
divisive-agglomerative clustering [107] is an incremental system for clustering 
streaming time series. The system is designed to process thousands of data 
streams that flow at high rate. The main features of the system include update 
time and memory consumption that do not depend on the number of examples 
in the stream. Moreover, the time and memory required to process an example 
decreases whenever the cluster structure expands. 

A fast agglomerative clustering method proposed in [46] uses an approximate 
nearest neighbor graph for reducing the number of distance calculations. The 
computational complexity is improved from O(rN?) to O(TN log N) at the cost 
of a slight increase in distortion, where 7 denotes the number of nearest-neighbor 
updates required at each iteration. 

Partitioning around medoids (PAM) [77] is based on finding k representative 
objects (medoids) that minimize the sum of the within-cluster dissimilarities. 
PAM works satisfactorily only for small data sets. Clustering large applications 
(CLARA) [77] is a modified PAM that can handle very large datasets. CLARA 
draws a sample of the data set, applies PAM on the sample, and finds the medoids 
of the sample; it draws multiple samples and gives the best clustering as the 
output. CLARANS (clustering large applications based on randomized search) 
[98] is a variant of CLARA that makes the search for the k-medoids more efficient. 
The runtime of a single call of CLARANS is close to quadratic. For small data 
sets, CLARANS [99] is a few times faster than PAM; the performance gap for 
larger data sets is even larger. Compared with CLARA, CLARANS has a search 
space that is not localized to a specific subgraph chosen a priori, and can produce 
clustering results with much better quality. CLARANS can handle not only point 
objects, but also polygon objects efficiently. 


Example 9.3: We apply vector quantization for image compression. We use the 
Lena image of 512 x 512 pixels as example. The training vectors are obtained by 
dividing the image into 4 x 4 blocks (arranged as a vector of 16 dimensions), and 
the desired codebook size is set to 256. By apply LBG, the peak signal-to-noise 
ratio (PSNR) is 30.4641 dB. The compression ratio is 1/16, and each vector is 
coded by 8 bits. When applying entropy coding, entropy is defined as the average 
number of bits needed for encoding a training vector, and the entropy for Lena is 
7.2904. Applying the same procedure on the baboon image of 512 x 512 pixels, 
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Figure 9.8 Vector quantization of an image. (a) The original image. (b) The quantized image. 


we obtain an entropy of 7.5258. Figure 9.8 shows the results for the original and 
compressed images. 


Hybrid hierarchical/partitional clustering 


Most partitional clustering algorithms run in linear time. However, the clustering 
quality of a partitional algorithm is not as good as that of hierarchical algorithms. 
Some methods exploit the advantages of both the hierarchical and partitional 
clustering techniques [125, 126, 56, 49]. 
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The VQ-clustering and VQ-agglomeration methods [126] involve a vector 
quantization process followed, respectively, by a clustering algorithm and an 
agglomerative clustering algorithm that treat codewords as initial prototypes. 
The agglomeration algorithm requires that each codeword be moved directly to 
the centroid of its neighboring codewords. A similar two-stage clustering proce- 
dure that uses SOM for vector quantization and an agglomerative clustering or 
C-means for further clustering is given in [125]. These two-stage methods have 
performance comparable to those of direct methods, with significantly reduced 
computational complexity [126, 125]. 

A two-stage clustering algorithm given in [120] clusters data with arbitrary 
shapes without knowing the number of clusters in advance. An ART-like algo- 
rithm is first used to partition data into a set of small multidimensional hyperel- 
lipsoids, and a dendrogram is then built to sequentially merge those hyperellip- 
soids. Hierarchical unsupervised fuzzy clustering [56] has the advantages of both 
hierarchical clustering and fuzzy clustering. Robust competitive agglomeration 
[49] employs competitive agglomeration to find the optimum number of clus- 
ters, uses prototypes to represent the global shape of the clusters, and integrates 
robust statistics to achieve noise immunity. 

Cohesion-based self-merging [88] runs in time linear to the size of input data 
set. Cohesion is a similarity measure to measure the intercluster distances. The 
method partitions the input data set into several small subclusters, and then 
continuously merges the subclusters based on cohesion in a hierarchical manner. 
The method is very robust and possesses excellent tolerance to outliers in various 
workloads. It is able to cluster the data sets of arbitrary shapes very efficiently. 


Constructive clustering techniques 


Conventional partitional clustering algorithms assume a network with a fixed 
number of clusters (nodes) K, which needs to be prespecified. However, selecting 
an appropriate value of K is a difficult task without prior knowledge of the input 
data. This difficulty can be resolved by using constructive clustering. A simple 
strategy for determining the optimal K is to perform clustering for a range of 
K, and select the value of K that minimizes a cluster-validity measure. 

The leader algorithm [63] is the fastest clustering algorithm. It requires one 
pass through the data to put each input pattern in a particular cluster or group 
of patterns. Associated with each cluster is a leader, which is a pattern against 
which new patterns will be compared to determine whether the new pattern 
belongs to this particular cluster. The leader algorithm starts with zero proto- 
types and adds the current input pattern as a prototype called leader whenever 
none of the existing prototypes is close enough to it. The cosine of the angle 
between an input vector and each prototype is used as a similarity measure. The 
clusters that are created first will tend to be very large. To determine the cluster 
that a new pattern will be mapped to, the modified leader algorithm searches 
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for the cluster leader closest to the pattern. If that closest cluster leader is close 
enough to the new pattern, then the new pattern belongs to that cluster; oth- 
erwise, the new pattern becomes the leader for a new cluster. In this way, each 
cluster has an equal chance of having the new pattern fit into it rather than 
clusters that are created earlier having an undue advantage. The choice of this 
threshold is critical. The training time is of the same order but lower than that 
of C-means. The drawbacks of the modified leader algorithm are the same as 
that of C-means, and additionally the first pattern presented will always be a 
cluster leader. 

ISODATA [8] is a popular early statistical method for data clustering. It can 
be treated as a variant of incremental C-means by incorporating some heuristics 
for merging and splitting clusters, and for handling outliers. Thus, ISODATA 
has a variable number of clusters K. 

The self-creating mechanism in the competitive learning process can adap- 
tively determine the natural number of clusters. Each node is associated with 
a local statistical variable, which is used to control the growing and pruning of 
the network architecture. The self-creating and organizing neural network [28] 
employs adaptively modified node thresholds to control its self-growth. At the 
presentation of a new input, if the winning node is active, the winning node is 
updated; otherwise, a new node is recruited from the winning node. The method 
avoids the underutilization problem, and has vector quantization accuracy and 
speed advantage over SOM and batch C-means. Self-splitting competitive learn- 
ing [145] can find the natural number of clusters based on the one-prototype- 
take-one-cluster paradigm and a validity measure for self-splitting. The paradigm 
enables each prototype to be situated at the centroid of one natural cluster when 
the number of clusters is greater than that of the prototypes. Self-splitting com- 
petitive learning starts with a single prototype and splits adaptively during the 
learning process until all the clusters are found. 

The growing cell structures (GCS) network [50] can be regarded as a modifi- 
cation of SOM by integrating the node-recruiting and pruning functions. GCS 
assigns each node with a local accumulated statistical variable called a signal 
counter u;. At each pattern presentation, only the winning node increases its 
signal counter uw by 1, and then all the signal counters u; decay with a forget- 
ting factor. After a fixed number of learning iterations, the node with the largest 
signal counter gets the right to insert a new node between itself and its farthest 
neighbor. The network occasionally prunes a node whose signal counter is less 
than a specified threshold during a complete epoch. The growing grid network 
[52] is strongly related to GCS, but has a strictly rectangular topology. By insert- 
ing complete rows or columns of units, the grid may adapt its height /width ratio 
to a given pattern distribution. The semi-supervised learning method for growing 
SOMs [72] allows fast visualisation of data class structure on the two-dimensional 
feature map, based on Fritzke’s supervised learning architecture used on GCS. 

The GNG model [51, 53] is based on GCS and neural gas. It is an SOM with- 
out a fixed global network dimensionality, i.e., GNG is able to adapt to the local 
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dimensionality as well as to the local density of the input data. In addition, the 
topology of the neurons reflect the topology of the input distribution. GNG is 
capable of generating and removing neurons and lateral connections dynami- 
cally. In GNG, lateral connections are generated according to the competitive 
Hebbian learning rule. GNG achieves robustness against noise and performs per- 
fect topology-preserving mapping. By integrating an online criterion so as to 
identify and delete useless neurons, GNG with utility criterion [53] is also able 
to track nonstationary data input. GNG-T [48], extended from GNG, performs 
vector quantization continuously over a distribution that changes over time. It 
deals with both sudden changes and continuous ones, and is thus suited to the 
video tracking framework. 

The dynamic cell structures (DCS) model [20] uses a modified Kohonen learn- 
ing rule to adjust the prototypes and the competitive Hebbian rule to establish 
a dynamic lateral connection structure. Applying the principle of DCS to GCS 
yields the DCS-GCS algorithm, which has a behavior similar to that of GNG. 
The life-long learning cell structures (LLCS) algorithm [61] is an online cluster- 
ing and topology-representation method. It employs a strategy similar to that 
of ART. A similarity-based unit pruning and an aging-based edge pruning pro- 
cedures are incorporated. 

In adaptive incremental LBG [114], new codewords are inserted in regions of 
the input vector space where the distortion error is highest until the desired 
number of codewords (or a distortion error threshold) is achieved. The adaptive 
distance function is adopted to improve the quantization process. During the 
incremental process, a removal-insertion technique is used to fine-tune the code- 
book to make the proposed method independent of the initial conditions. The 
method works better than enhanced LBG. It can also be used for such tasks: 
with fixed distortion error, to minimize the number of codewords and find a 
suitable codebook. 

Growing hierarchical tree SOM [45] has an SOM-like self-organizing process 
that allows the network to adapt the topology of each layer of the hierarchy 
to the characteristics of the training set. The network grows as a tree, it starts 
with a triangle SOM and every neuron grows adding one new triangle SOM. 
Moreover, the training process considers the possibility of deleting nodes. 

By employing a lattice with a hyperbolic grid topology, hyperbolic SOM com- 
bines the virtues of the SOM and hyperbolic spaces for adaptive data visual- 
ization. However, due to the exponential growth of its hyperbolic lattice, it also 
exacerbates the need for addressing the scaling problem of SOMs comprising 
very large numbers of nodes. Hierarchically growing hyperbolic SOM [101] com- 
bines the virtues of hierarchical data organization, adaptive growing to a required 
granularity, good scaling behaviour and smooth, map-based browsing. 
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Cluster validity 


The number of clusters is application-specific, and is usually specified by a user. 
An optimal number of clusters or a good clustering algorithm is only in the sense 
of a certain cluster-validity criterion. Many cluster validity measures are defined 
for this purpose. 


Measures based on compactness and separation of clusters 


A good clustering algorithm should generate clusters with small intracluster 
deviations and large intercluster separations. Cluster compactness and cluster 
separation are two measures for describing the performance of clustering. Given 
the clustered result of a data set X, the cluster compactness, cluster separation, 
and overall cluster quality measures are, respectively, defined by [68] 


1 0%, o(e) 
Ecour = a a 
CMP = oa)? (9.18) 
1 K _ d*(e4,¢5) 
Eser = py RG i 
SEP KKI) 2e o, (9.19) 
Eoca(7) = yEcmp + (1 — 7) sep, (9.20) 


where K is the number of clusters, c; and c; are, respectively, the centers of 
clusters i and j, o (ci) denotes the standard deviation of cluster i, o(A’) is the 
standard deviation of data set X, go is the deviation of a Gaussian distribution, 
d(c;,c;) is the distance between c; and cj, and y € [0,1]. Small Ecmp means 
that all the clusters have small deviations, and smaller Eggzp corresponds to 
better separation performance. 

Another popular cluster-validity criterion is defined as a function of the ratio 
of the sum of the within-cluster scatters to the between-cluster separation [37] 


dwes (cx) + dwcs (c1) \ 


9.21 
dgcs (Ck, C1) ven 


1 & 
E = — 

wor=>)- max | 

k=1 
where the within-cluster scatter for cluster k, denoted as dwcs (ex), and the 
between-cluster separation for cluster k and cluster l, denoted as dgcs (Ck, c1), 

are, respectively, calculated by 
dei læ; — cell 


dwcs (ck) = TMR; (9.22) 


dgcs (Ck, €) = |lex — cll , (9.23) 


Npk being the number of data points in cluster k. The best clustering minimizes 
Ewpr. This measure indicates good clustering results for spherical clusters [125]. 
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A cluster-validity criterion has been defined in [131] for evaluating fuzzy clus- 
tering; it minimizes the ratio of compactness Ecmpı and separation Espp1, 
Exp = $m, defined by 


E 
ÆJ 
2 

Eomp1 = 7 D D Hip (£p — cilla > (9.24) 

i=1 p=1 
Esgpı = min ||e; — cll 9.25 
spp = min lei — cjl, (9.25) 
where ||- ||4 denotes a weighted norm, and A is a positive-definite symmetric 


matrix. Notice that Ecmp1ı is equal to the criterion function for FCM given by 
(8.40) when m = 2 and A is the identity matrix. 

In [140], a cluster-validity criterion is defined based on ratio and summation 
of compactness and overlap measures. Both measures are calculated from mem- 
bership degrees. The maximal value of the criterion denotes the optimal fuzzy 
partition that is expected to have high compactness and a low degree of overlap 
among clusters. The criterion is reliable and effective, especially when evaluating 
partitions with clusters that widely differ in size or density. 


Measures based on hypervolume and density of clusters 


A good partitioning of the data usually leads to a small total hypervolume and a 
large average density of the clusters. Cluster-validity criteria can thus be selected 
as the hypervolume and average density of the clusters. The fuzzy hypervolume 
criterion is defined by [54, 82] 


K 
Eruy = 5 Vi, (9.26) 
i=1 


where V; is the volume of the ith cluster 
V; = [det (F;)]? , (9.27) 
and F;, the fuzzy covariance matrix of the ith cluster, is defined by [60] 
1 N 
F; = = — nt (æj — ci) (£; — c)”. (9.28) 
};j=1 Hij j=1 
The average fuzzy density criterion is defined by 


hae 
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where S; sums the membership degrees of only the members within the hyper- 
ellipsoid 


N 
Si = >. tä, Va j E€ fz; (£j — ci)” po: (£j — Ci) < 1} . (9.30) 
j=l 


The fuzzy hypervolume criterion typically has a clear extremum; the average 
fuzzy density criterion is not desirable when there is substantial cluster overlap- 
ping and large variability in the compactness of the clusters [54]. The average 
fuzzy density criterion averages the fuzzy densities of individual clusters, and a 
partitioning that results in both dense and loose clusters may lead to a large 
average fuzzy density. 


Measures for shell clustering 

For shell clustering, the hypervolume and average density measures are still appli- 
cable. However, the distance vector between a pattern and a prototype needs to 
be redefined. In the case of spherical shell clustering, the distance vector between 
pattern æ; and prototype A; = (c;, ri) is defined by 


Li — Ci 


dj, = (£; — ci) — ri | (9.31) 


|æ; — ell’ 
where r; is the radius of the shell. The fuzzy hypervolume and average fuzzy 
density measures for spherical-shell clustering are obtained by replacing x; — ci 
in (9.28) and (9.30) by dji. 

For shell clustering, the shell thickness measure can be used to describe the 
compactness of a shell. In the case of fuzzy spherical shell clustering, the fuzzy 
shell thickness of a cluster can be defined by [82] 


Dia (ug) (les — ell — 79)” 

fy ea ae) 
The average shell thickness of all clusters can be used as a cluster-validity crite- 
rion for shell clustering 


T; = (9.32) 


1 K 
Erk = =) T). (9.33) 


Crisp silhouette and fuzzy silhouette 


The average silhouette width criterion, or crisp silhouette [77], is defined as 


ix 
Eos = Ș 2 Sj, (9.34) 
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where s; is the silhouette of pattern j according to 


bpj — apj 
2 max{apj,bpj}’ ee) 
apj is the average distance of pattern j to all other patterns belonging to cluster 
p, dqj is the average distance of pattern j to all patterns in cluster q, q A p, and 
bpj = Ming=1,...,K,qżp dqj, Which represents the dissimilarity of pattern j to its 
closest neighboring cluster. 

Fuzzy silhouette [22] improves crisp silhouette in detecting regions with higher 
data density when the data set involves overlapping clusters, besides being more 
appealing in the context of fuzzy cluster analysis. It is defined as 

Eji (ng — Bing) 89 
= SS 7; 
2j- (Hmi — Hnj)®” 
where Hmj and Hnj are the first and second largest elements of the jth column 
of the fuzzy partition matrix, respectively, and a > 0 is a weighting coefficient. 

Fuzzy silhouette is computationally much less intensive than the fuzzy hyper- 
volume and average partition density criteria, especially when the data set 
involves many attributes. It is straightforwardly applicable as an objective func- 
tion for global optimization methods, designed for automatically finding the right 
number of clusters in a data set. 


(9.36) 


Other measures 

Several robust-type validity measures are proposed in [130] by analyzing the 
robustness of a validity measure using the ¢-function of M-estimate. Median- 
type validity measures [130] are robust to noise and outliers, and work better 
than the mean-type validity measures. 

A review of fuzzy cluster-validity measures is given in [127]. Moreover, exten- 
sive comparisons of many measures in conjunction with FCM are conducted on 
a number of widely used data sets. It is concluded that none of the measures 
correctly recognizes optimal cluster numbers K for all test data sets. 

Nearest-neighbor clustering [21] is a baseline algorithm to minimize arbitrary 
clustering objective functions. It is statistically consistent for all commonly used 
clustering objective functions. An empirical risk approximation approach for 
unsupervised learning is proposed along the line of empirical risk minimization 
for the supervised case. The clustering quality is an expectation with respect 
to the true underlying probability distribution, and the empirical quality is the 
corresponding empirical expectation. Then, generalization bounds can be derived 
using VC dimensions. 

Bregman divergences include a large number of useful distortion functions 
such as squared loss, Kullback-Leibler divergence, logistic loss, Mahalanobis dis- 
tance, Itakura-Saito distance and I-divergence [9]. C-means, LBG for clustering 
speech data and information-theoretic clustering for clustering probability distri- 
butions [39] are special cases of Bregman hard clustering for squared Euclidean 
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distance, Itakura-Saito distance and Kullback-Leibler divergence, respectively. 
This is achieved by first posing the hard clustering problem in terms of mini- 
mizing the loss in Bregman information, a quantity motivated by rate-distortion 
theory, and then deriving an iterative algorithm that monotonically decreases 
this loss. 


Projected clustering 


Sparsity is a phenomenon of high-dimensional data. In text data, documents 
related to a particular topic are categorized by one subset of terms. Such a sit- 
uation also occurs in supplier categorization. Subspace clustering seeks to group 
objects into clusters on subsets of dimensions or attributes of a data set. Clus- 
tering high-dimensional data has been a major challenge due to the inherent 
sparsity of the points. The similarity between different members of a cluster can 
only be recognized in the specific subspace. 

CLIQUE [5] identifies dense clusters in subspaces of maximum dimensionality. 
It works in a level-wise manner, exploring k-dimensional projected clusters after 
clusters of dimensionality k — 1 have been discovered. CLIQUE automatically 
finds subspaces with high-density clusters. It produces identical results irrespec- 
tive of the order of the input presentation. CLIQUE scales O(N) for N samples, 
and scales exponentially regarding the cluster dimensionality. 

The partitional approach PROCLUS [1] is similar to iterative clustering tech- 
niques such as C-means or k-medoids [77]. It is a medoid-based projected cluster- 
ing algorithm that improves the scalability of CLIQUE by selecting a number of 
good candidate medoids and exploring the clusters around them. Some patterns 
are initially chosen as the medoids. But, before assigning every pattern in the 
data set to the nearest medoid, each medoid is first assigned a set of neighbor- 
ing patterns that are close to it in the input space to form a tentative cluster. 
The technique iteratively groups the patterns into clusters, and eliminates the 
least relevant dimensions from each of the clusters. Since PROCLUS optimizes 
a criterion similar to that of C-means, it can find only spherically shaped clus- 
ters. Both the number of clusters and the average number of dimensions per 
cluster are user-defined. ORCLUS [2] improves PROCLUS by adding a merging 
process of clusters, and selecting for each cluster principal components instead 
of attributes. ORCLUS can discover arbitrarily oriented clusters. It still relies 
on user-supplied values in deciding the number of dimensions to select for each 
cluster. 

Halite is a fast, deterministic subspace clustering method [31]. It analyzes 
the point distribution in the full space by performing a multiresolution, recur- 
sive partition of that space so as to find clusters covering regions with varying 
sizes, shapes, density, correlated axes, and number of points. Halite uses MDL 
to automatically tune a density threshold with regard to the data distribution. 
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It is robust to noise. Halite is linear or quasilinear in time and space in terms of 
the data size and dimensionality. 

HARP [137] does not depend on user inputs in determining the relevant dimen- 
sions of clusters. It utilizes the relevance index, histogram-based validation, 
and dynamic threshold loosening to adaptively adjust the merging requirements 
according to the clustering status. HARP has high accuracy and usability, even 
when handling noisy data. It exploits the clustering status to adjust the internal 
thresholds dynamically without the assistance of user parameters. 

Based on the analogy between mining frequent itemsets and discovering 
dense projected clusters around random points, DOC [138] performs iterative 
greedy projected clustering. Several techniques that employ the branch-and- 
bound paradigm are proposed to efficiently discover the projected clusters [138]. 
DOC can automatically discover the number of clusters K, and it can discover 
a set of clusters with large size variations. A density-based projective clustering 
algorithm (DOC/FastDOC) [105] requires to set the maximum distance between 
attribute values, and pursues an optimality criterion defined in terms of density 
of each cluster in its corresponding subspace. In practice it may be difficult to 
set the parameters of DOC, as each relevant attribute can have different local 
variance. 

Soft subspace clustering [73] is to cluster data objects in the entire data space, 
but assign different weighting values to different dimensions of clusters in the 
clustering process. EWKM [75] extends C-means to calculate a weight for each 
dimension in each cluster and uses the weight values to identify the subsets 
of important dimensions that categorize different clusters. This is achieved by 
including the weight entropy in the objective function that is minimized in C- 
means. An additional step is added to automatically compute the weights of all 
dimensions in each cluster. EWKM outperforms PROCLUS and HARP. 

High-dimensional projected stream (HPStream) clustering [4] incorporates a 
fading cluster structure and projection-based clustering methodology. It is incre- 
mentally updatable and is highly scalable on both the number of dimensions 
and the size of the data streams, and it achieves better clustering quality than 
previous stream clustering methods does. 

Projected Clustering based on the k-Means Algorithm (PCKA) [18] is a robust 
partitional distance based projected clustering algorithm. Interactive projected 
clustering (IPCLUS) [3] performs high-dimensional clustering by cooperation 
between the human and the computer. 


Spectral clustering 
Spectral clustering arises from concepts in spectral graph theory. The basic idea 
is to construct a weighted graph from the initial data set where each node repre- 


sents a pattern and each weighted edge accounts for the similarity between two 
patterns; the clustering problem is configured as a graph-cut problem, where an 
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appropriate objective function has to be optimized. Spectral clustering is con- 
sidered superior to C-means in terms of having deterministic polynomial-time 
solution and the ability to model arbitrary shaped clusters. It is an approach for 
finding non-convex clusters. 

The clustering problem can be tackled by means of spectral graph theory. The 
core of this theory is the eigenvalue decomposition of the Laplacian or similarity 
matrix L of the weighted graph obtained from data. The spectral clustering 
algorithm amounts to embedding the data into a feature space by using the 
eigenvectors of the similarity matrix in such a way that the clusters may be 
separated by hyperplanes. Spectral clustering finds the m eigenvectors Zn xm 
corresponding to the m smallest eigenvalues of L (ignoring the trivial constant 
eigenvector). Using a standard method like C-means, we then cluster the rows 
of Z to yield a clustering of the original data points. 

A direct connection between kernel PCA and spectral methods has been shown 
n [14]. A unifying view of kernel C-means and spectral clustering methods 
has been pointed out in [41]. A general weighted kernel C-means objective is 
mathematically equivalent to a weighted graph clustering objective. Based on 
this equivalence, a fast multilevel algorithm is developed that directly optimizes 
various weighted graph clustering objectives [41]. This eliminates the need for 
eigenvector computation for graph clustering problems. The multilevel algorithm 
removes the restriction of equal-sized clusters by using kernel C-means to opti- 
mize weighted graph cuts. 

Incremental spectral clustering handles not only insertion/deletion of data 
points but also similarity changes between existing points. In [100], spectral clus- 
tering is extended to evolving data, by introducing the incidence vector /matrix 
to represent two kinds of dynamics in the same framework and by incrementally 
updating the eigen-system. 

A Markov random walks view of spectral clustering is given in [69]. This inter- 
pretation shows that many properties of spectral clustering methods can be 
expressed in terms of a stochastic transition matrix P obtained by normalizing 
the affinity matrix such that its rows sum to 1. 


Coclustering 


Coclustering, or biclustering, simultaneously clusters patterns and their features. 
Among the advantages of coclustering are its good performance in high dimen- 
sion and its ability to provide more interpretable clusters than its clustering 
counterpart. Coclustering can perform well in high-dimensional space because 
its feature clustering process can be seen as a dynamic dimensionality reduction 
for the pattern space and vice versa. A bicluster is a subset of rows that exhibit 
similar behavior across a subset of columns, and vice versa. 
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Hartigan [62] pioneered this type of analysis using two-way analysis of variance 
to locate constant-valued submatrices within data sets. The method refers to the 
simultaneous clustering of both rows and columns of a data matrix. 

Soft model dual fuzzy possibilistic coclustering [123] is inspired by possibilistic 
FCM. The model targets robustness to outliers and richer representations of 
coclusters. It preserves the desired properties of possibilistic FCM, and has the 
same time complexity as that of possibilistic FCM and FCM, for the numer of 
(co-)clusters. 

An information-theoretic coclustering algorithm [40] views a nonnegative 
matrix as the estimate of a (scaled) empirical joint probability distribution of two 
discrete random variables and poses the coclustering problem as an optimiza- 
tion problem in information theory, where the optimal coclustering maximizes 
the mutual information between the clustered random variables subject to con- 
straints on the number of row and column clusters. 


Handling qualitative data 


A majority of the real-world data is described by a combination of numeric and 
qualitative (nominal, ordinal) features such as categorical data, which is the case 
in survey data. There are a number of challenges in clustering categorical data. 
First, lack of an inherent order on the domains of the individual attributes pre- 
vents the definition of a notion of similarity, which catches resemblance between 
categorical data objects. A typical approach to processing categorical values is to 
resort to a preprocess such as binary encoding, which transforms each categorical 
attribute into a set of binary attributes in such a way that each distinct categor- 
ical value is associated with one of the binary attributes. Consequently, after the 
transformation, all categorical attributes become binary attributes, which can 
thus be treated as numeric attributes with the domain of {0, 1}. 

Some algorithms for clustering categorical data are k-modes [74], fuzzy cen- 
troid, and fuzzy k-partitions [135]. k-modes is an extension of C-means to cate- 
gorical domains and domains with mixed numeric and categorical values [74]. It 
uses a simple matching dissimilarity measure to deal with categorical patterns, 
replaces the means of clusters with modes, and uses a frequency-based method 
to update the modes in the clustering process to minimize the cost function. 
The k-prototypes algorithm, through the definition of a combined dissimilarity 
measure, further integrates C-means and k-modes to allow for clustering pat- 
terns described by mixed numeric and categorical attributes. A fuzzy k-partitions 
model [135] is based on the likelihood function of multivariate multinomial dis- 
tributions. FCM has also been extended for clustering symbolic data [42]. All 
these algorithms have linear complexity. 
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Bibliographical notes 


Hebbian learning based data clustering using spiking neurons [86] is capable of 
distinguishing between clusters and noisy background data and finds an arbitrary 
number of clusters of arbitrary shape. The clustering ability is more powerful 
than C-means and linkage clustering, and the time complexity of the method 
is also more modest than that of its generally used strongest competitor. The 
robust, locally mediated, self-organizing design makes it a genuinely nonparamet- 
ric approach. The algorithm does not require any information about the number 
or shape of clusters. 

Constrained clustering algorithms incorporate known information about the 
desired data partitions into the clustering process [12]. The must-link and cannot- 
link constraints are two common types of constraints about pairs of objects [12]. 

Information bottleneck [122] is an information-theoretic principle. The infor- 
mation bottleneck principle can be motivated from Shannon’s rate-distortion 
theory, which provides lower bounds on the number of classes we can divide 
a source given a distortion constraint. Among all the possible clusterings of a 
given pattern set into a fixed number of clusters, information bottleneck clus- 
tering minimizes the loss of mutual information between the patterns and the 
features extracted from them. 


9.1 For competitive clustering, name and describe a few heuristics to avoid the 
dead-units problem. 


9.2 Implement the RPCL algorithm and apply it for image segmentation on 
an image. 


9.3 Consider the grayscale Lena image of size 512 x 512. Apply image quanti- 
zation for the following two cases: 

(a) The image is divided into 4x4 blocks and the resulting 16,384 16- 
dimensional vectors are the input vector data. 

(b) The image is divided into 8 x 8 blocks and the resulting 4, 096 16-dimensional 
vectors are the input vector data. 

Use PSNR to evaluate the reconstructed images after encoding and decoding. 


9.4 Clustering Algorithms’ Referee Package (CARP, http://www.mloss. org) 
is an open source C package for evaluating clustering algorithms. CARP gener- 
ates data sets of different clustering complexity and assesses the performance of 
the concerned algorithm in terms of its ability to classify each data set relative 
to the true grouping. Download CARP and use it to evaluate the performance 
of different clustering algorithms. 


9.5 Randomly generate 20 points in the square 21, £2 € [2,8]. 
(a) Create an MST of the weighted graph based on the Eucledean distance. 
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(b) Remove the inconsistent edges and example from the formed clusters. 
(c) Write a program to complete the clustering. 


9.6 In Example 9.3, both the Lena and baboon images are compressed by using 
LBG. Now, consider to get the codebook by applying LBG on the Lena image. 
Then apply the codebook on the baboon image, and get the quantized image. 
Calculate the PSNR of the compressed baboon image. Calculate the entropy of 
the codebook vector. 


9.7 Evaluate three different cluster validity measures on the iris data set by 
searching the number of clusters 2 < C < JN. 

a) Using FCM. 

b) Using GK clustering. 
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10.1 


Radial basis function networks 


Introduction 


Learning is an approximation problem, which is closely related to the conven- 
tional approximation techniques, such as generalized splines and regularization 
techniques. The RBF network has its origin in performing exact interpolation 
of a set of data points in a multidimensional space [82]. The RBF network is a 
universal approximator, and it is a popular alternative to the MLP, since it has 
a simpler structure and a much faster training process. Both models are widely 
used for classification and function approximation. 

The RBF network has a network architecture similar to that of the classical 
regularization network [81], where the basis functions are Green’s functions of 
the Gram operator associated with the stabilizer. If the stabilizer exhibits radial 
symmetry, the basis functions are radially symmetric as well and hence, an RBF 
network is obtained. From the viewpoint of approximation theory, the regular- 
ization network has three desirable properties [81, 28]: It can approximate any 
multivariate continuous function on a compact domain to an arbitrary accuracy, 
given a sufficient number of units; it has the best-approximation property since 
the unknown coefficients are linear; and the solution is optimal in the sense that 
it minimizes a functional that measures how much it oscillates. 

The RBF network with a localized RBF is a receptive-field or localized net- 
work. The localized approximation method provides the strongest output when 
the input is near the prototype of a node. For a suitably trained localized RBF 
network, similar input vectors always generate similar outputs, while distant 
input vectors produce nearly independent outputs. This is the intrinsic local 
generalization property. A receptive-field network is an associative network in 
that only a small subspace is determined by the input to the network. The 
domain of receptive-field functions is practically a finite real interval defined by 
the parameters of the function. This property is particularly attractive since the 
receptive-field function produces a local effect. Thus, receptive-field networks can 
be conveniently constructed by adjusting the parameters of the receptive-field 
functions and/or adding or removing neurons. 
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Figure 10.1 Architecture of the RBF network. 


10.1.1 


RBF network architecture 


The RBF network, shown in Fig. 10.1, is a Jj-J2-J3 feedforward network. Each 
node in the hidden layer uses an RBF ¢(r) as its nonlinear activation function. 
¢0(x) = 1 corresponds to the bias in the output layer, while ¢;(a”) = ¢ (x — ci), 
where c; is the center of the ith node and ọ(æ)is an RBF. The hidden layer 
performs a nonlinear transform of the input, and the output layer is a linear 
combiner mapping the nonlinearity into a new space. The biases of the output 
layer neurons can be modeled by an additional neuron in the hidden layer, which 
has a constant activation function ¢9(r) = 1. The RBF network can achieve a 
global optimal solution to the adjustable weights in the minimum MSE sense by 
using the linear optimization method. 
For input pattern æ, the output of the network is given by 


J2 
ylz) = X` wrid (lle — cll), i=1,..., J3, (10.1) 
k=1 


where y;(x) is the ith output of the RBF network, w,; is the connection weight 
from the kth hidden unit to the ith output unit, cz, is the prototype or center 
of the kth hidden unit, and ||- || denotes the Euclidean norm. ¢(-) is typically 
selected as the Gaussian function. 

For a set of N pattern pairs {(ap,y,)}, (10.1) can be expressed in matrix 
form 


Y=W’'S, (10.2) 


where W = [wj,...,wy,] isa J2 x J3 weight matrix, w; = (wii,..., Wi)”, ® = 
[bi-n] is a Jo x N matrix, $, = (p,1,-- s pJ)” is the output of the 
hidden layer for the pth sample, ¢p,~ = ¢ (||£p — cxl|), Y = [y1; Y2,- --, Yn] isa 
J3 x N matrix, and y, = (Yp,1,--- side) - 
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Universal approximation of RBF networks 


The RBF network has universal approximation and regularization capabilities. 
Theoretically, the RBF network can approximate any continuous function arbi- 
trarily well, if the RBF is suitably chosen [81, 78, 63]. 

Micchelli considered the solution of the interpolation problem s (ax,) = Yk, 
k =1,...,J2, by functions of the form s(x) = = WeP (læ — z\l”), and pro- 
posed Micchelli’s interpolation theorem [69]. ¢(-) is required to be completely 
monotonic on (0,00), that is, it is continuous on (0,00) and its Ith-order 
derivative 6 (x) satisfies (—1)'4 (x) > 0, Va € (0,00) and l = 0,1,2,.... A less 
restrictive condition has been given in [46], where ¢(-) is continuous on (0,00) 
and its derivatives satisfy (—1)'6 (a) > 0, Va € (0,00) and 1 = 0,1, 2. 

RBFs possess excellent mathematical properties. In the context of the exact 
interpolation problem, many properties of the interpolating function are rela- 
tively insensitive to the precise form of the nonlinear function ¢(-) [82]. The 
choice of RBF is not crucial to the performance of the RBF network [14]. 

The Gaussian RBF network can approximate, to any degree of accuracy, any 
continuous function by a sufficient number of centers cj, i = 1,..., J2, and a 
common standard deviation g > 0 in Lp-norm, p € [1, co] [78]. A class of RBF 
networks can achieve universal approximation when the RBF is continuous and 
integrable [78]. 

The requirement of the integrability of the RBF is relaxed in [63]. For an 
RBF that is continuous almost everywhere, locally essentially bounded, and not 
a polynomial, the RBF network can approximate any continuous function with 
respect to the uniform norm [63]. From this result, such RBFs as ¢(r) =e 77 
and g(r) = ez? also lead to universal approximation capability [63]. 

In [37], it is proved in an incremental constructive method that three-layer 
feedforward networks with randomly generated hidden nodes are universal 
approximators, and only the weights linking the hidden and output layers need 
to be adjusted. The proof itself gives an efficient incremental construction of 
the network. Theoretically, the learning algorithms so derived can be applied to 
a wide range of activation functions no matter whether they are sigmoidal or 
nonsigmoidal, continuous or noncontinuous, differentiable or nondifferentiable; 
it can be used to train threshold networks directly. 

A bound on the generalization error for feedforward networks is given by (2.2) 
[74]. This bound has been considerably improved to O (4) z) in [50] for RBF 
network regression with the MSE function. 

A decaying RBF ¢(x) is not zero at x = 0, but approaches zero as x — oo. It is 
clear that the Gaussian function and the wavelet functions such as Mexican hat 
wavelet ọ(x) = im — «?)e~*’/? are decaying RBFs. A constructive proof 
is given in [34] for the fact that a decaying RBF network with n+ 1 hidden 
neurons can interpolate n + 1 multivariate samples with zero error. The given 
decaying RBFs can uniformly approximate any continuous multivariate function 
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Figure 10.2 The RBF network for classification: each class is fitted by a kernel function. 


10.1.3 


10.1.4 


with arbitrary precision without training [34], thus giving faster convergence 
and better generalization performance than conventional RBF algorithm, BP, 
extreme learning machine and SVMs. 


RBF networks and classification 


In a classification problem, we model the posterior probabilities p(C|x) for each 
of the classes. We have Bayes’ theorem 


_ P(#|CK)P(Ce) _ _ p(#lCx)P(Ce) 
a aaa a p(w) PG)’ 


where P(C;,) is the prior probability. If we model p(a|C;,) as an RBF kernel, we 
can define a normalized RBF kernel to model Bayes’ theorem 


O ac) 

X; p(æ|C;)P(C;) 
Therefore the RBF network has a natural resemblance with Bayes’ theorem. This 
is illustrated in Fig. 10.2. 

In practice, each class conditional distribution p(a|C;,) can be represented by 
a mixture of models, as a linear combination of kernel functions. 


(10.3) 


bx (X) (10.4) 


Learning for RBF networks 


Like MLP learning, the learning of the RBF network is formulated as the mini- 
mization of the MSE function: 


N 
1 2 il 2 
n= W 2 lly- W" all = T- W ta (10.5) 


where Y = [y1, Y2;..., Yn], Yı is the target output for the ith sample in the 
training set, and || - || is the Frobenius norm defined as || A||} = tr (ATA). 
RBF network learning requires the determination of the RBF centers and 
weights. The selection of the RBF centers is most critical to a successful RBF 
network implementation. The centers can be placed on a random subset or all of 
the training examples, or determined by clustering or via a learning procedure. 
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Figure 10.3 Illustration of RBFs. ø = 1, 0 = 0. 


10.2 


For some RBFs such as the Gaussian, it is also necessary to determine the 
smoothness parameter o. The RBF network using the Gaussian RBF is usually 
termed the Gaussian RBF network. Existing learning algorithms are mainly 
developed for the Gaussian RBF network, and can be modified accordingly when 
other RBFs are used. 

The Gaussian RBF network can be regarded as an improved alternative to 
the four-layer probabilistic neural network [93], which is based on the Parzen 
classifier. In a probabilistic neural network, a Gaussian RBF node is placed at 
the position of each training pattern so that the unknown density can be well 
interpolated and approximated. This technique yields optimal decision surfaces 
in a Bayesian sense. Training is to associate each node with its target class. This 
approach, however, severely suffers from the curse of dimensionality and results 
in poor generalization. 


Radial basis functions 


A number of functions can be used as the RBF [81, 69, 63] 


olr) =e 207, Gaussian, (10.6) 

olr) =r? In(r), thin-plate spline, (10.7) 
1 

r) = —— logistic function, 10.8 

Wr) = — e (10.8) 


where r > 0 denotes the distance from data point x to center c, g is used to con- 
trol the smoothness of the interpolating function, and 0 in (10.8) is an adjustable 
bias. These RBFs are illustrated in Fig. 10.3. 

The Gaussian (10.6) and the logistic function (10.8) are localized RBFs with 
the property that (r) — 0 as r > oo. Physiologically, there exist Gaussian-like 
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receptive fields in cortical cells [81]. As a result, (r) is typically selected as the 
Gaussian or other localized RBFs. 

The RBF network conventionally uses the Gaussian function (10.6) as the 
RBF. The Gaussian is compact and positive. It is motivated from the point of 
view of kernel regression and kernel density estimation. In fitting data in which 
there is normally distributed noise with the inputs, the Gaussian is the optimal 
basis function in the LS sense [103]. The Gaussian is the only factorizable RBF, 
and this property is desirable for hardware implementation of the RBF network. 

The thin-plate spline function (10.7) is another popular RBF for universal 
approximation. The use of the thin-plate spline is motivated from a curve-fitting 
perspective [64]. It diverges at infinity and is negative over the region of r € (0, 1). 
However, for training purposes, the approximated function needs to be defined 
only over a specified range. There is some empirical evidence to suggest that the 
thin-plate spline better fits the data in high-dimensional settings [64]. 

A pseudo-Gaussian function in one-dimensional space is introduced by select- 
ing the standard deviation o in the Gaussian (10.6) as two different positive 
values, namely, o— for x < 0 and o, for x > 0 [83]. In n-dimensional space, the 
pseudo-Gaussian function can be defined by 


gw) = J [ vi, (10.9) 


i=l 
(2:224) 
e e Ti < Ci 
Qi (zx) = CER ; (10.10) 
> o 
e Eş Ti = Cj 
Ta : . : 
where c = (c1,...,Cn) is the center vector, and index 7 runs over the dimension 


of the input space n. The pseudo-Gaussian function is not strictly an RBF due 
to its radial asymmetry, and this, however, eliminates the symmetry restriction 
and provides the hidden units with greater flexibility with respect to function 
approximation. 

When utilized to approximate the functional behavior with sharp noncircu- 
lar features, many circular-shaped Gaussian basis functions may be required. In 
order to reduce the size of the RBF network, direction-dependent scaling, shap- 
ing, and rotation of Gaussian RBFs are introduced in [92] for maximal trend sens- 
ing with minimal parameter representations for function approximation. Shaping 
and rotation of the RBFs help in reducing the total number of function units 
required to approximate any given input-output data, while improving accuracy. 


Radial basis functions for approximating constant values 

Approximating functions with constant-valued segments using localized RBF's 
is most difficult. If a function has nearly constant values in some intervals, the 
Gaussian RBF network is inefficient in approximating these values unless its 
variance is very large approaching infinity. The sigmoidal RBF, as a composite 
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of a set of sigmoidal functions, can be used to deal with this problem [54] 


1 it 
olz) = TteAl@)+9 14 ¢-Al@—o)-4]’ 
where the bias 6 > 0 and the gain 6 > 0. ¢(a) is radially symmetric with the 
maximum at the center c. @ controls the steepness and @ controls the width 
of the function. The shape of (x) is approximately rectangular if 80 is large. 


For large 2 and 0, it has a soft trapezoidal shape, while for small 6 and @ it is 
bell-shaped. ¢(a) can be extended to an n-dimensional approximation 


(10.11) 


n 


ols) = [| p: (as), (10.12) 


i=1 


1 1 
i) ae ree a e e (Ma) 


where £= (£1,..-,£n)", C= (C1, ..-; Cn)”, 0 = (01,...,4n)*, and 
B = (Ai, a Pay x 
When £; and 6; are small, the sigmoidal RBF ¢(a) will be close to zero and 
the corresponding node will have little contribution to the approximation task 
regardless of the tuning of the other parameters thereafter. To accommodate 
constant values of the desired output and to avoid diminishing the kernel func- 
tions, ¢(a) can be modified by adding an additional term to the product term 
pi (vi) [55] 
(x) = [J (yi (wi) + Gi (2:)), (10.14) 
i=1 
where 
Pi (xi) = [1 — vi (x:)] e7% (i>e)? (10.15) 


with a; > 0, and ¢; (xi) being used as a compensating function to keep the prod- 
uct term from decreasing to zero when y; (x;) is small. 8; and a; are, respectively, 
associated with the steepness and sharpness of the product term and 6; controls 
the width of the product term. The parameters are adjusted by the gradient- 
descent method. 

An alternative approach is to use the raised-cosine RBF [89] 


cos*(4#) jæļ|<1 
= 2 rs 10.16 
ae) = {6 ere (10.16) 
where (x) is a zero-centered function with compact support since ¢(0) = 1 and 
(a) = 0 for |x| > 1. The raised-cosine RBF can represent a constant function 
exactly using two terms. This RBF can be generalized to n dimensions [89] 


n 


olx) = [[ 0 (zi - ci). (10.17) 


i=1 
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Figure 10.4 RBFs for approximating constant-valued segments in the one-dimensional space: 
SRBF—sigmoidal RBF (10.11), MSRBF—modified sigmoidal RBF (10.14), and 
RCRBF—aised-cosine RBF (10.16). The center is selected as c; = c= 0. 


10.3 


Notice that ¢(a) is nonzero only when a is in the (—1,1)” vicinity of c. 

Figure 10.4 illustrates the sigmoidal, modified sigmoidal and raised-cosine 
RBFs with different selections of 8 and 0. In Chapter 21, we will introduce some 
popular fuzzy membership functions, which can be used as RBFs by suitably 
constraining some parameters. 


Learning RBF centers 


RBF network learning is usually implemented using a two-phase strategy. The 
first phase specifies and fixes suitable centers c; and their respective RBF param- 
eters o;. For the Gaussian RBF, o;’s denote standard deviations, also known as 
widths or radii. The second phase adjusts the network weights W. In this section, 
we describe the first phase. 


Selecting RBF centers randomly from training sets 
A simple method to specify the RBF centers is to randomly select a subset of 
the input patterns from the training set as the RBF centers. If the training 
set is representative of the learning problem, this method is appropriate. This 
method is relatively insensitive to the use of pseudoinverse, hence it may be a 
regularization method. 

The Gaussian RBF network using the same ø for all RBF centers has universal 
approximation capability [78]. This global width can be selected as the average of 
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all the Euclidian distances between the ith RBF center and its nearest neighbor 
o = (lle; — c;|l) , (10.18) 
where 


le- cll = min lle: — ex. (10.19) 
One can also select o by o = dimax/W2J2, where dmax is the maximum distance 
_.Joik>i ||Ci — Cx|| [9]. This choice 
makes the Gaussian RBF neither too steep nor too flat. 

In practice, the width of each RBF o;, i = 1,..., J2, can be determined accord- 
ing to the data distribution in the region of the corresponding RBF center. A 
heuristic for selecting g; is to average the distances between the ith RBF center 
and its L nearest neighbors. Alternatively, g; is selected according to the distance 
of unit 7 to its nearest-neighbor unit j, o; = a ||c; — c;||, a € [1.0, 1.5). 


between the selected centers, dmax = MaX; k=1, 


Selecting RBF centers by clustering training sets 

Clustering is usually used for determining the RBF centers. The training set 
is grouped into appropriate clusters, whose prototypes are then used as RBF 
centers. The number of clusters can be specified or determined automatically 
depending on the clustering algorithm. 

Unsupervised clustering such as C-means is popular for clustering RBF centers 
[72]. RBF centers determined by supervised clustering are usually more efficient 
for RBF network learning than those determined by unsupervised clustering [13], 
since the distribution of the output patterns in the training set is also considered. 
When the RBF network is trained for classification, LVQ1 is a popular method 
for clustering the RBF centers. 

The relationship between the augmented unsupervised clustering process and 
the MSE of RBF network learning has been investigated in [100]. In the case 
of the Gaussian RBF and any Lipschitz continuous RBF, a weighted MSE for 
supervised quantization yields an upper bound on the MSE of RBF network 
learning. This upper bound and consequently the output error can be made 
arbitrarily small by decreasing the quantization error, which can be accomplished 
by increasing the number of hidden units. 

After the RBF centers are determined, the covariance matrices of the RBFs 
are set to the covariances of the input patterns in each cluster. In this case, the 
Gaussian RBF network is extended to the generalized RBF network using the 
Mahalanobis distance, defined by the weighted norm [81] 


$ (|| — cell) = en 2 eH) wen) (10.20) 


where the squared weighted norm |æ] = (Ax)? (Ax) = a7 A? Ag and X~ = 
2A7A. When the Euclidean distance is employed, one can also select the width 
of the Gaussian RBF network using the heuristics for selecting RBF centers 
randomly from training sets. 
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Learning the weights 


After RBF centers and their widths or covariance matrices are determined, learn- 
ing of the weights W is reduced to a linear optimization problem, which can be 
solved using the LS method or the gradient-descent method. 


Least squares methods for weight learning 


After the parameters related to the RBF centers are determined, the weight 
matrix W is then trained to minimize the MSE (10.5). This LS problem requires 
a computational complexity of O (N J?) for N > J2 when popular orthogonal- 
ization techniques such as SVD and QR desomposition are applied. A simple 
representation of the solution of the batch LS method is given explicitly by [9] 


W = (87) Y7 = (667) ' aY", (10.21) 


where [-]' is the pseudoinverse of the matrix within. The over- or underdeter- 
mined linear LS system is an ill-conditioned problem. SVD is an efficient and 
numerically robust technique for dealing with such an ill-conditioned problem 
and is preferred. 

According to (10.21), if 67 = I, inversion operation is unnecessary. The 
optimum weight can be computed by 


wr = PJ, k=1,...,Js, (10.22) 


where Yk = (Yi4k,--- SYNE) corresponds to the kth row of Y. Based on this 
observation, an efficient, noniterative weight learning technique has been intro- 
duced by applying GSO on RBFs [44]. The RBFs are first transformed into a 
set of orthonormal RBFs for which the optimum weights are computed. These 
weights are then recomputed in such a way that their values can be fitted back 
into the original RBF network structure, i.e., with kernel functions unchanged. 
In addition, the method has low storage requirements, and the computation pro- 
cedure can be organized in a parallel manner. Incorporation of new hidden nodes 
aimed at improving the network performance does not require recomputation of 
the network weights already calculated. The contribution of each RBF to the 
overall network output can be evaluated. 

When the full data set is not available and samples are obtained online, the 
RLS method can be used to train the weights online 


___ P(t-1)¢, 
K(t) = FPG Dd tn (10.24) 
eilt) = yi — Twit- 1), i=1,..., J3, (10.25) 
P(t) = z [P -1)- KHT P(t- 1], (10.26) 


ww ai bt. com DOOOO00 


322 Chapter 10. Radial basis function networks 





























1+ + [e] 1+ 
0.8 0.8 
0.6 
x< gts 
0.4 0.4 
oe 0.2 
of o + “| © 
0 + 
0 02 04 06 08 1 0 02 #04 06 08 1 
x, o, 


(a) (b) 


Figure 10.6 Solve the XOR problem using the RBF network. (a) The input (#1-x%2) patterns. (b) The 
patterns in the feature (1-2) space. 


where p € (0, 1] is the forgetting factor. Typically, P(0) = apIy,, where ao is a 
sufficiently large number and Ij, is the Jz x Jz identity matrix, and w;(0) is 
selected as a small random matrix. 


Example 10.1: We solve the XOR problem using the RBF network. The objective 
is to classify the input patterns (1,1) and (0,0) as class 0, and classify (1,0) and 
(0,1) as class 1. 

We employ a 2-2-1 RBF network, as shown in Fig. 10.5. We set the bias 
b =0.4. We select two of the points (0,1) and (1,0) as the RBF centers c;, 


2 
læ-e;ll 


i = 1,2, and select the Gaussian RBF ¢ġ;(x) =e 272, and ø = 0.5. Given the 
input xı = (0,0), z2 = (0,1), £3 = (1,0), x4 = (1,1), we get their mappings in 
the feature space. The input points and their mappings in the feature space are 
shown in Fig. 10.6. It is seen that the input patterns are linearly inseparable in 
the input space, whereas they are linearly separable in the feature space. 

By using (10.21), we can solve the weights as wı = w2 = 0.3367. We thus get 
the decision boundary in the feature space as 0.4528¢, + 0.4528¢2 — 0.4 = 0 or 
0.4528 e~2(#i+(@2-1)”) + 0,4528 e7 2(("1-1)*+#3) — 0.4 = 0 in the input space. 
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RBF network learning using orthogonal least squares 


Optimal subset selection techniques are computationally prohibitive. The OLS 
method [14, 15, 16] is an efficient way for subset model selection. The approach 
chooses and adds RBF centers one by one until an adequate network is con- 
structed. All the training examples are considered as candidates for the centers, 
and the one that reduces the MSE the most is selected as a new hidden unit. 
GSO is first used to construct a set of orthogonal vectors in the space spanned 
by the vectors of the hidden unit activation @,, and a new RBF center is then 
selected by minimizing the residual MSE. Model-selection criteria are used to 
determine the size of the network. 


Batch orthogonal least squares 


The batch OLS method can not only determine the weights, but also choose 
the number and the positions of the RBF centers. Batch OLS can employ the 
forward [15, 16] and backward [33] center selection approaches. 

When the RBF centers are distinct, ®7 is of full rank. The orthogonal decom- 
position of T is performed using QR decomposition 


PT=Q [o] (10.27) 


where Q = [q,...qy] is an N x N orthogonal matrix and R is a Jz x Jp upper 
triangular matrix. By minimizing the MSE given by (10.5), one can make use of 
the invariant property of the Frobenius norm 


E= L [YQ -wTsQ]ji = L QTY? - QT@T WII"... (10.28) 
Let 
B 
TyT _ 








where B = [ei] and B = [biz] are, respectively, a Jz x Jz and an (N — J2) x J3 
matrices. We then have 


2 2 






































1 || B R 1 || B- RW 
E=—]|]=|-— WI =— = 10. 
N || B | 0 | N B (an 
F F 
Thus, the optimal W that minimizes E is derived from 
RW =B. (10.31) 
In this case, the residual 
1 i2 
E = |B|} - (10.32) 
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This is batch OLS. 

Due to the orthogonalization procedure, it is very convenient to implement 
the forward and backward center-selection approaches. The forward selection 
approach is to build up a network by adding, one at a time, centers at the 
data points that result in the largest decrease in the network output error at 
each stage. The backward selection algorithm is an alternative approach that 
sequentially removes from the network, one at a time, those centers that cause 
the smallest increase in the residual. 

The error reduction ratio due to the kth RBF neuron is defined by [16] 


(EER) aa: 
tr(YY7) ~ 


Error-reduction ratio is a performance-oriented criterion. RBF network training 


ERR; = k=1,...,N. (10.33) 


can be in a constructive way and the centers with the largest error-reduction 
ratio values are recruited until 
J2 
1- ERR, <p (10.34) 
k=1 
where p € (0,1) is a tolerance. 

An alternative terminating criterion can be based on AIC [16], which balances 
between the performance and the complexity. The weights are determined at the 
same time. The criterion used to stop center selection is a simple threshold on 
the error-reduction ratio. To improve generalization, regularized forward OLS 
methods can be implemented by penalizing large weights [76]. 

The computational complexity of the orthogonal decomposition of an informa- 
tion matrix ®7 is O (N.J3). When the size of a training data set N is large, batch 
OLS is computationally demanding and also needs a large amount of computer 
memory. 

The RBF center clustering method based on the Fisher ratio class separability 
measure [66] is similar to the forward selection OLS algorithm [15, 16]. 


Recursive orthogonal least squares 


Recursive OLS algorithms have been proposed for updating the weights of single- 
input single-output [6] and multi-input multi-output systems [114, 29]. 

At iteration t — 1, following a procedure similar to that for batch OLS, and 
applying QR decomposition, we have [29] 


#7 (¢ — 1) = Q(t - 1) mi se (10.35) 
Q7(t—1)(Y(t-1))" = Bu. | . (10.36) 
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The sizes of the matrices ® and Y increase with a new data for each iteration. 
Then at iteration t, the update for R(t — 1) and B(t — 1) are calculated using 
another QR decomposition 











R(t—1)] _ R(t) 
| pTO | = Qı (t) | 0 | ; (10.37) 
Bit) | orm | BE-1) 
w(t) D ns | (10.38) 
The minimization of E(t) leads to the optimal W(t), which is solved by 
R(t) W(t) = B(t). (10.39) 


Since R(t) is an upper triangular matrix, W(t) can be easily solved by backward 
substitution. Update the residual at iteration t using the recursive equation 


et 1 = 
= (Bl, => (9 Ol, + BE- ||; 
-Bl = 5 ( 


t-1 1 
——E(t-1) +5 IEO. (10.40) 


E(t) 


Initial values can be selected as R(0) = al, where a is a small positive number 
such as 0.01, B(0) = 0, and \|B(0)||2, = 0. In offline training, the weights need 
only be computed once at the end of training, since their values do not affect 
the recursive updates in (10.37) and (10.38). 

After training with recursive OLS, the final triangular system of (10.39), with 
t = N, contains important information about the learned network, and can be 
used to sequentially select the centers to minimize the network output error. 
Forward and backward center selection methods are developed from this infor- 
mation, and Akaike’s final prediction error criterion is used in model selection 
[29]. 


Supervised learning of all parameters 
The preceding methods for selecting the network parameters are practical, but 
by no means optimal. The gradient-descent method is the simplest method for 


finding the minimum value of E. In this section, we apply the gradient-descent 
based supervised methods to RBF network learning. 
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Supervised learning for general RBF networks 


To derive the supervised learning algorithm for the general RBF network, we 
rewrite the error function (10.5) as 


1 N J3 
E= N 22 (n) a (10.41) 


where en, is the approximation error at the ith output node for the nth example 


J2 
En i = Yni — >D WmiP (Jan — Cm||) = Yn, i Z wi bn- (10.42) 


m=1 


Taking the derivative of E with respect to Wmi and Cm, respectively, we have 


OE 


N 
2 
— =- > nid (lEn cml), m=1,...,2,1=1,...,J3, (10.43) 
OWmi D 


N 
i Ln — Cm 
XC eniġ (lEn — emll) ja aT "7 1,...,J2, (10.44) 


? 
En — Cm| 


J3 
OE 2 
Ben = N 2o Ymi 


where ¢(-) is the first derivative of ¢(-). 
The gradient-descent method is defined by the update equations 


n=1 








OE 
Awmi = — l 10.4 
w m Dun (10.45) 
OE 
Rese 10.4 
Cc rm ( 0 6) 


where 7 and n2 are learning rates. 

To prevent the situation that two or more centers are too close or coincide 
with one another during the learning process, one can add a term such as 
axe ¥ (lla — cg||) to E, where y(-) is an appropriate repulsive potential. The 
gradient-descent method given by (10.45) and (10.46) is modified accordingly. 

A simple strategy for initialization is to select the RBF centers based on a 
random subset of the examples and the weights W as a matrix with small random 
components. To accelerate the search process, one can use clustering to find 
the initial RBF centers and LS to find the initial weights, and then apply the 
gradient-descent procedure to refine the learning result. 

Setting the gradients to zero, the optimal solutions to the weights and centers 
can be derived. The gradient-descent procedure is the iterative approximation to 
the optimal solutions. For each sample n, if we set en; = 0, then the right-hand 
side of (10.43) is zero, we then achieve the global optimum and get 


Yn = WT. (10.47) 
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For all samples, 
Y=W’®. (10.48) 


This is exactly the same set of linear equations as (10.2). The optimum solution 
to weights is given by (10.21). Equating (10.44) to zero leads to 


n, imn 


J3 . N e 
dint Wmi Un=1 fee tn 
J3 ON enibdmn ’ 
Vint Wmi 2 m=1 [en-em 


where @mn = ¢ (|En — Em||). Thus, the optimal centers are weighted sums of the 
data points, corresponding to a task-dependent clustering problem. 


(10.49) 


Cm = 





10.6.2 Supervised learning for Gaussian RBF networks 


For the Guassian RBF network, the RBF at each center can be assigned a dif- 
ferent width o; 


læ-e;ll? 
_ le cill 


dilz) =se 7%, 10.50) 
The RBFs can be further generalized: 
file) =e Be) Er ee), 10.51) 


where ©; € R77! is a positive-definite, symmetric covariance matrix. When 
E, 1 is in general form, the shape and orientation of the axes of the hyperellipsoid 
are arbitrary in the feature space. 

If ©; is a diagonal matrix with nonconstant diagonal elements, ©; * is com- 
pletely defined by a vector ø; € R”, and each ¢; is a hyperellipsoid whose axes 
are along the axes of the feature space 
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For the Jı-dimensional input space, each RBF of the form (10.51) has a total 
of Ah ¥3) independent adjustable parameters, while each RBF of the form 
(10.50) and each RBF of the form (10.52) have Jı +1 and 2Jı independent 
parameters, respectively. There is a tradeoff between using a small network with 


many adjustable parameters and using a large network with fewer adjustable 


parameters. 
When using the RBF of the form (10.50), we get the gradients as 
N J3 
OE 2 Ln — Cm 
son = W N- Ọm (£n) -n XO eni Wim, (10.53) 
a n=1 m vaci 
N 2 J3 
OE 2 |En =. Cm || 
ao 5 Pir (Zn) ar 5 EniWim (10.54) 
ar n=1 Mh i=1 
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Similarly, for the RBF of the form (10.52), the gradients are given by 


SS a a dm ( Ln) i mi 3 En, iWi m, (10.55) 


Ton J n=1 ea J 


2 Js 


N 
E. 25S da (xn) ae a a (10.56) 


Iam, j n=1 m,j i=l 


Adaptations for c; and X; are along the negative gradient directions. The weights 
W are updated by (10.43) and (10.45). To prevent unreasonable radii, the updat- 
ing algorithms can also be Aca by adding to E a constraint term that penal- 
izes small radii, Ee = =), 4 5 or Be =i; aa" 

In [111], the improved “Levene: Marquardt algorithm [110] is applied for 
training RBF networks to adjust all the parameters. The proposed improved 
second-order algorithm can normally reach smaller training/testing error with 
much less number of RBF units. During the computation process, quasi-Hessian 
matrix and gradient vector are accumulated as the sum of related submatrices 
and vectors, respectively. Only one Jacobian row is stored and used for multipli- 
cation, instead of the entire Jacobian matrix storage and multiplication. 


Discussion on supervised learning 


The gradient-descent algorithms introduced thus far are batch learning algo- 
rithms. As discussed in Chapter 4, by dropping + D in the error function 
E and accordingly in the algorithms, one can update the parameters at each 
example (£p, Yp): This yields incremental learning algorithms, which are typ- 
ically much faster than their batch counterparts for suitably selected learning 
parameters. 

Although the RBF network trained by the gradient-descent method is capable 
of providing equivalent or better performance compared to that of the MLP 
trained with BP, the training time for the two methods are comparable [106]. 
The gradient-descent method is slow in convergence since it cannot efficiently 
use the locally tuned representation of the hidden-layer units. When the hidden- 
unit receptive fields, controlled by the widths o;, are narrow, for a given input 
only a few of the total number of hidden units will be activated and hence only 
these units need to be updated. However, in the gradient-descent method, there 
is no limitation on g;, thus there is no guarantee that the RBF network remains 
localized after supervised learning [72]. As a result, the computational advantage 
of locality is not utilized. 

The gradient-descent method is prone to finding local minima of the error 
function. For reasonably well-localized RBF, an input will generate a significant 
activation in a small region, and the opportunity of getting stuck at a local mini- 
mum is small. Unsupervised methods can be used to determine o;. Unsupervised 
learning is used to initialize the network parameters, and supervised learning is 
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usually used for the fine-tuning of the network parameters. The ultimate RBF 
network learning is typically a blend of unsupervised and supervised algorithms. 


Extreme learning machines 


Extreme learning machine (ELM) [38] is a learning algorithm for generalized 
single-hidden-layer feedforward networks. The hidden-layer node parameters are 
selected randomly and the output weights are determined analytically. It tends 
to provide good generalization performance at extremely fast learning speed. It 
can learn thousands of times faster than gradient-based learning algorithms and 
SVM, and tends to achieve similar or better generalization performance. ELM 
can be used to train single-hidden-layer feedforward networks with many non- 
differentiable activation functions [39]. A similar idea for ELM has been imple- 
mented in the no-propagation (no-prop) algorithm for training the MLP [109]. 
The no-prop algorithm is comparable to BP in terms of the training and gen- 
eralization performance, but is much simpler to implement, and also converges 
much faster. 

The universal approximation capability of ELM has been proved in an incre- 
mental ELM method [37]. ELM with any bounded nonlinear piecewise contin- 
uous activation can function as a universal approximator. It is shown that the 
VC dimension of ELM is equal to the number of hidden nodes of ELM with 
probability one. 

Online sequential ELM [62] can learn data one-by-one or chunk-by-chunk 
with fixed or varying chunk size. The parameters of hidden nodes are randomly 
selected and the output weights are analytically determined based on the sequen- 
tially arriving data. Only the number of hidden nodes is manually chosen. 

ELM can be used to train neural networks with threshold functions directly 
instead of approximating them with sigmoidal functions for the ease of hardware 
implementation [39]. Also, ELM does not need manually tuned parameters and is 
much faster. Pruning of neurons in a network built using ELM has been proposed 
in [84] for classification purposes. 

The optimally pruned ELM [70] adds steps to make ELM more robust and 
generic. It uses a combination of three different types of kernels, namely linear, 
sigmoid and Gaussian kernels, whereas the original ELM uses only sigmoid ker- 
nels. It uses a leave-one-out criterion for the selection of an appropriate number 
of neurons. The algorithm performs several orders of magnitude faster than MLP, 
SVM and Gaussian process, but maintains an accuracy that is comparable to 
that of SVM. 

A two-stage ELM algorithm [51] first applies a forward recursive algorithm to 
select the hidden nodes from the candidates randomly generated at each step 
and then adds nodes to the network until the stopping criterion achieves its 
minimum. In the second stage, the insignificant hidden nodes are removed from 
the network, thus drastically reducing the network complexity. ELM regressor is 
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designed by applying a constructive method [52]. It selects the optimal number 
of hidden nodes by an unbiased risk estimation-based criterion. 


Various learning methods 


All general-purpose unconstrained optimization methods, including those intro- 
duced for the MLP, are applicable for RBF network learning, since the RBF 
network is a special feedforward network and RBF network learning is an uncon- 
strained optimization problem. These include popular second-order approaches 
like LM, CG, BFGS and EKF, and heuristic global optimization methods. 

The LM method is used for RBF network learning [30, 67, 79]. In [67, 79], the 
LM method is used for estimating nonlinear parameters, and the SVD-based LS 
method is used for linear weight estimation at each iteration. In [79], at each 
iteration the weights are updated many times during the process of looking for 
the search direction to update the nonlinear parameters. 

EKF can be used for RBF network learning [91]. After the number of centers 
is chosen, EKF simultaneously solves for the prototype vectors and the weight 
matrix. A decoupled EKF further decreases the computational complexity of the 
training algorithm [91]. In [22], a pair of parallel running extended Kalman filters 
are used to sequentially update both the output weights and the RBF centers. 

In [46], the RBF network is reformulated by using RBFs formed in terms of 
admissible generator functions, and provides a fully supervised gradient-descent 
training method. LP models with polynomial time complexity are also employed 
to train the RBF network [86]. In [31], a multiplication-free Gaussian RBF net- 
work with a gradient-based nonlinear learning algorithm is described for adaptive 
function approximation. 

The learning algorithm for training cosine RBF networks given in [47] trains 
reformulated RBF networks by updating selected adjustable parameters to min- 
imize the class-conditional variances at the outputs of their RBFs. Cosine RBF 
networks trained by such a learning algorithm are capable of identifying uncer- 
tainty in data classification. The classification accuracy of cosine RBF networks 
is also improved by rejecting ambiguous feature vectors based on their responses. 

The RBF network using regression weights can significantly reduce the number 
of hidden units, and is effectively used for approximating nonlinear dynamic 
systems [53, 89, 83]. For a Jı-J2-1 RBF network, the linear regression weights 
are defined by [53] 


w,=al@+6;, ¢=1,...,d2, (10.57) 


where w; is the weight from the ith hidden unit to the output unit, a; = 
(aop @i,1) ++ +5 ain) is the regression parameter vector, © = (1, zT)” is the aug- 
mented input vector, and £; is a zero-mean Gaussian white-noise process. For the 
Gaussian RBF network, the RBF centers c; and their widths g; can be selected 
by C-means and the nearest-neighbor heuristic [72], while the parameters of the 
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regression weights are estimated by the EM method [53]. In [89], a simple but fast 
computational procedure is achieved by using a high-dimensional raised-cosine 
RBF. Storage space is also reduced by allowing the RBF centers to be situated 
at a nonuniform grid of points. 

Some regularization techniques for improving the generalization capability of 
the MLP and the RBF network have been discussed in [68]. As in the MLP, the 
favored weight quadratic penalty term ` wi; is also appropriate for the RBF 
network. The widths of the RBFs are widely known to be a major source of ill- 
conditioning in RBF network training, and large width parameters are desirable 
for better generalization. Some suitable penalty terms for widths are given in 
[68]. 

Hyper basis function (HyperBF) networks [81] are generalized RBF networks 
with a radial function of a Mahalanobis-like distance. HyperBF networks can be 
constructed with three learning phases [90], where regular two phase methods are 
used to initialize the network and in the third phase, means, scaling factors, and 
weights are estimated simultaneously by gradient descent and backpropagation 
using a single variable learning factor for all parameters that are estimated adap- 
tively. In a regularization method that performs soft local dimension reduction in 
addition to weight decay [65], hierarchical clustering is used to initialize neurons 
followed by a multiple-step-size gradient optimization using a scaled version of 
Rprop with a localized partial backtracking step. The training provides faster 
and smoother convergence than regular Rprop. 

The probabilistic RBF network [98] constitutes a probabilistic version of the 
RBF network for classification that extends the typical mixture model approach 
to classification by allowing the sharing of mixture components among all classes. 
It is an alternative approach for class conditional density estimation. A typical 
learning method employs the EM algorithm and depends strongly on the initial 
parameter values. In [23], a technique for incremental training of the probabilis- 
tic RBF network for classification is proposed, based on criteria for detecting a 
region that is crucial for the classification task. After the addition of all compo- 
nents, the algorithm splits every component of the network into subcomponents, 
each corresponding to a different class. 

When a training set contains outliers, robust statistics can be applied for 
robust learning of the RBF network. Robust learning algorithms are usually 
derived from the M-estimator method [88, 21]. The robust RBF network learn- 
ing algorithm [88] is based on Hampel’s tanh-estimator function. The network 
architecture is initialized by using the conventional SVD-based learning method. 
The robust part of the learning method is implemented iteratively using the 
CG method. The annealing robust RBF network [21] improves the robustness of 
the RBF network against outliers for function approximation by using the M- 
estimator and the annealing robust learning algorithm [20]. The median RBF 
algorithm [8] is based on robust parameter estimation of the RBF centers, and 
employs the Mahalanobis distance. 
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Normalized RBF networks 


The normalized RBF network is defined by normalizing the vector composing 
the responses of all the RBF units [72] 


J2 
yi(e) = X wesde(@), i=1,..., Js, (10.58) 
k=1 
where 
éx(a) = 2) (10.59) 


Xj (z — c;) 
The normalization operation is nonlocal, since each hidden node is required to 
know about the outputs of other hidden nodes. Hence, the convergence process 
is computationally costly. 

The normalized RBF network given by (10.58) can be presented in another 
form [11, 53]. The network output is defined by 
D wyih (Œ — cj) 

eer $ (x — cj) 
Now, normalization is performed in the output layer. As it already receives infor- 
mation from all the hidden units, the locality of the computational processes is 
preserved. The two forms of the normalized RBF network, (10.58) and (10.60), 
are equivalent. 

In the normalized RBF network of the form (10.60), the traditional roles of the 
weights and activities in the hidden layer are exchanged. In the RBF network, 
the weights determine as to how much each hidden node contributes to the 
output, while in the normalized RBF network, the activities of the hidden nodes 
determine which weights contribute most to the output. The normalized RBF 
network provides better smoothness than the RBF network does. Due to the 
localized property of the receptive fields, for most data points, there is usually 
only one hidden node that contributes significantly to (10.60). The normalized 
RBF network (10.60) can be trained using a procedure similar to that for the 
RBF network. The normalized Gaussian RBF network exhibits superiority in 
supervised classification due to its soft modification rule. It is also a universal 
approximator in the space of continuous functions with compact support in the 
space L? (RP, dæ) [4]. 

The normalized RBF network is an RBF network with a quasilinear activation 
function with a squashing coefficient decided by the actvations of all the hidden 
units. The normalized RBF network loses the localized characteristics of the 
localized RBF network and exhibits excellent generalization properties. Thus, it 
softens the curse of dimensionality associated with localized RBF networks [11]. 
The normalized Gaussian RBF network outperforms the Gaussian RBF network 
in terms of training and generalization errors, and exhibits a more uniform error 


yi(2) = (10.60) 
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over the training domain. In addition, the normalized Gaussian RBF network is 
not sensitive to the RBF widths. 


Optimizing network structure 


The optimum structure of an RBF network is to determine the number and 
locations of the RBF centers automatically by using constructive and pruning 
methods. 


Constructive methods 


The constructive approach gradually increases the number of RBF centers until 
a criterion is satisfied. In [45], a new prototype is created in a region of the input 
space by splitting an existing prototype cj selected by a splitting criterion, and 





splitting is performed by adding the perturbation vectors +e; to cj. The resulting 
vectors Cj + €j together with the existing centers form the initial set of centers for 
the next growing cycle. Existing algorithms for updating the centers cj, widths 





gj, and weights can be used. The process continues until a stopping criterion 
is satisfied. In a heuristic incremental algorithm [26], the training phase adds a 
hidden node cz at each epoch t by an error-driven rule. 

The incremental RBF network architecture using hierarchical gridding of the 
input space [7] allows for a uniform approximation without wasting resources. 
Additional layers of Gaussians at lower scales are added where the residual error 
is higher. The method shows a high accuracy in the reconstruction, and it can 
deal with nonevenly spaced data points and is fully parallelizable. 

Hierarchical RBF network [27] is a multiscale version of the RBF network. 
It is constituted by hierarchical layers, each containing a Gaussian grid at a 
decreasing scale. The grids are not completely filled, but units are inserted only 
where the local error is over a threshold. The constructive approach is based 
only on the local operations, which do not require any iteration on the data. It 
allows for an effective network to be built in a very short time. The coarse-to-fine 
approach enables the hierarchical RBF network to grow until the reconstructed 
surface meets the required quality. 

The forward OLS algorithm [14] is a well-known constructive algorithm. Based 
on the OLS algorithm, a constructive algorithm for the generalized Gaussian 
RBF network is given in [102]. RBF network learning based on a modification 
to the cascade-correlation algorithm works in a way similar to the OLS method, 
but with a significantly faster convergence [57]. 

The dynamic decay adjustment algorithm is a fast constructive training 
method for the RBF network when used for classification [5]. It has indepen- 
dent adjustment for the decay factor or width g; of each prototype. The method 
is faster and also achieves higher classification accuracy than the RBF network 
does. 
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Constructive methods with pruning 

The normalized RBF network can be trained by using a constructive method 
with pruning strategy based on the novelty of the data and the overall behavior 
of the network [83]. The network starts from one neuron, and adds a new neuron 
if an example passes two novelty criteria, until a specified maximum number of 
neurons is reached. The first criterion is the same as (10.62), and the second one 
deals with the activation of the nonlinear neurons max; ¢; (x+) < Ç, where Ç is a 
threshold. A sequential learning algorithm is derived from the gradient-descent 
method. After the whole pattern set is presented at an epoch, the algorithm starts 
to remove those neurons that meet any of the three cases, namely, neurons with 
a very small mean activation for the whole pattern set, neurons with a very small 
activation region, or neurons having an activation very similar to that of other 
neurons. 

The dynamic decay adjustment [5] may result in too many neurons. Dynamic 
decay adjustment with temporary neurons [77] introduces online pruning of neu- 
rons after each dynamic decay adjustment training epoch. After each training 
epoch, if the individual neurons cover a sufficient number of samples, they are 
marked as permanent; otherwise, they are deleted. In dynamic decay adjust- 
ment with selective pruning and model selection [75], only a portion of the neu- 
rons that cover only one training example are pruned and pruning is carried 
out only after the last epoch of the dynamic decay adjustment training. The 
method improves the generalization performance of dynamic decay adjustment 
and dynamic decay adjustment with temporary neurons, but yields a larger net- 
work size than dynamic decay adjustment with temporary neurons. 

The resource-allocating network (RAN) [80] and RAN algorithms with pruning 
strategy are well-known RBF network construction methods, which are described 
in Section 10.9.2. 


Resource-allocating networks 


RAN [80] is a sequential learning method for the localized RBF network such 
as the Gaussian RBF network, which is suitable for online modeling of non- 
stationary processes. The network begins with no hidden units. As pattern pairs 
are received during training, a new hidden unit may be added according to the 
novelty in the data. The novelty in the data is defined by two conditions 


|e: — cill > elt), (10.61) 
|le(t)| = ly: > f (æl > Emin; (10.62) 
where c; is the center nearest to x+, the prediction error e = (e1,..., En) , and 


e(t) and emin are thresholds to be selected appropriately. The algorithm starts 
with e(t) = Emax, Where Emax is chosen as the largest scale in the input space, 
typically the entire input space of nonzero probability. The distance e(t) shrinks 
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ta 
= 


exponentially as e(t) = max { &maxe™ 16min}, where 7 is a decay constant. e(t) 
is decayed until it reaches £min. Assuming that there are k nodes at time t— 1, 
for the Gaussian RBF network, the newly added hidden unit at time t can be 


initialized as 


Chii = Ëi (10.63) 
W(k+1)j = e;j(t), 7 = 1,...,J3, (10.64) 
Ok+1 = Q |x: = cill ; (10.65) 





where 0,41 is selected based on the nearest-neighbor heuristic and a is a param- 
eter defining the size of neighborhood. If pattern pair (x, y+) does not pass the 
novelty criteria, no hidden unit is added and the existing network parameters are 
adapted using the LMS method. The RAN method performs much better than 
RBF network learning using random centers and that using the centers clus- 
tered by C-means [72] do in terms of network size and MSE. It achieves roughly 
the same performance as the MLP trained with BP does, but with much less 
computation. 

EKF-based RAN [42] replaces the LMS method by the EKF method for the 
network parameter adaptation so as to generate a more parsimonious network. 
Two geometric criteria, namely the prediction error criterion, which is the same 
as (10.62), and the angle criterion, are also obtained from a geometric viewpoint. 
The angle criterion attempts to assign RBF's that are nearly orthogonal to all the 
other existing RBFs. These criteria are proved equivalent to Platt’s criteria [80]. 
In [43], the statistical novelty criterion is defined. By using the EKF method and 
using the statistical novelty criterion to replace the criteria (10.61) and (10.62), 
for a given task, more compact networks and smaller MSEs are achieved than 
RAN and EKF-based RAN. 


Resource allocating networks with pruning 
The RAN method can be improved by integrating node pruning procedure [112, 
113, 85, 87, 99, 35, 36]. Minimal RAN [112, 113] is based on EKF-based RAN, 
and achieves a more compact network with equivalent or better accuracy by 
incorporating a pruning strategy to remove inactive nodes and augmenting the 
basic growth criterion of RAN. The output of each RBF unit is scaled as 

; lo: (x)| À 

Oe) — e Gal t= eee De (10.66) 
If 6;(a@) is below a predefined threshold 6 for a given number of iterations, this 
node is idle and can be removed. For a given accuracy, the minimal RAN achieves 
a smaller complexity than the MLP trained with RProp does, and achieves 
a more compact network and requiring less training time than the MLP con- 
structed by dependence identification does. 

In [85], RAN is improved by using the Givens QR decomposition-based RLS for 

the adaptation of the weights and integrating a node-pruning strategy. The error- 
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reduction ratio criterion in [15] is used to select the most important regressors. 
In [87], RAN is improved by using in each iteration the combination of the SVD 
and QR-cp methods for determining the structure as well as for pruning the 
network. SVD and QR-cp determine a subset of RBFs that is relevant to the 
linear output combination. In the early phase of learning, the addition of RBFs 
is in small groups, and this leads to an increased rate of convergence. 

The growing and pruning algorithm for RBF (GAP-RBF) [35] and the gener- 
alized GAP-RBF [36] are RAN-based sequential learning algorithms for realizing 
parsimonious RBF networks. These algorithms make use of the notion of signif- 
icance of a hidden neuron, which is defined as a neuron’s statistical contribution 
over all the inputs seen so far to the overall performance of the network. In 
addition to the two growing criteria of RAN, a new neuron is added only when 
its significance is also above a chosen learning accuracy. If during training the 
significance of a neuron becomes less than the learning accuracy, that neuron 
will be pruned. For each new pattern, only its nearest neuron is checked for 
growing, pruning, or updating using EKF. Generalized GAP-RBF enhances the 
significance criterion such that it is applicable for training examples with arbi- 
trary sampling density. GAP-RBF and generalized GAP-RBF outperform RAN, 
EKF-based RAN, and minimal RAN in terms of learning speed, network size 
and generalization performance. 

In [99], EKF-based RAN with statistical novelty criterion [43] is extended by 
incorporating an online pruning procedure, which is derived using the parameters 
and innovation statistics estimated from EKF. The online pruning method is 
analogous to saliency-based OBS and OBD. IncNet and IncNet Pro [40] are 
RAN-EKF networks with statistically controlled growth criterion. The pruning 
method is similar to OBS, but based on the result of the EKF algorithm. 


Pruning methods 


Well-known pruning methods are OBD and OBS, which are described in Chap- 
ter 4. Pruning algorithms based on the regularization technique are also popular, 
since additional terms that penalize the complexity of the network are incorpo- 
rated into the MSE criterion. 

The pruning method proposed in [59] starts from a big RBF network, and 
achieves a compact network through an iterative procedure of training and selec- 
tion. The training procedure adaptively changes the centers and the width of the 
RBFs and trains the linear weights. The selection procedure performs the elim- 
ination of the redundant RBFs using an objective function based on the MDL 
principle. In [73], all the data vectors are initially selected as centers. Redundant 
centers are eliminated by merging two centers at each adaptation cycle by using 
an iterative clustering method. The technique is superior to the traditional RBF 
network algorithms, particularly in terms of processing speed and the solvability 
of nonlinear patterns. In [49], two methods are described for reducing the size of 
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the probabilistic neural network while preserving the classification performance 
as good as possible. 

In classical training methods for node-open fault, one needs to consider many 
potential faulty networks. In [60], the Kullback-Leibler divergence is used to 
define an objective function for improving the fault tolerance of RBF networks. 
Compared with some conventional approaches, including weight-decay based reg- 
ularizers, this approach has better fault-tolerant ability. In [94], an objective func- 
tion is presented for training a functional-link network to tolerate multiplicative 
weight noise. Under some mild conditions the derived regularizer is essentially 
the same as a weight decay regularizer. This explains why applying weight decay 
can also improve the fault-tolerant ability of an RBF with multiplicative weight 
noise. 


Complex RBF networks 


In a complex RBF network, the input and output of the network are complex 
values, whereas the activation function of the hidden nodes is the same as that 
for the RBF network. The Euclidean distance in the complex domain is defined 
by [17] 


tle 


d (£1, Ci) = [wi = ci)” (a,—ci)| , (10.67) 


where c; is a Jj-dimensional complex center vector. The output weights are 
complex valued. Most of the existing RBF network learning algorithms can be 
easily extended for training various versions of the complex RBF network [17, 
12, 56, 41]. When using clustering techniques to determine the RBF centers, the 
similarity measure can be based on the distance defined by (10.67). The Gaussian 
RBF is usually used in the complex RBF network. 

The Mahalanobis distance (10.51) defined for the Gaussian RBF can be 
extended to the complex domain [56] 

1 


d (ws, c) = (lm - alt- 7 E- Dm- at- n), i=. 02. 
(10.68) 
Notice that transpose T in (10.51) is changed into Hermitian transpose H. 
Learning of the complex Gaussian RBF network can be performed in two 
phases, where the RBF centers are first selected by using incremental C-means 
and the weights are then solved by fixing the RBF parameters [56]. At each 
iteration t, C-means first finds the winning node c,, by using the nearest-neighbor 
paradigm, and then updates both the center and the variance of the winning node 
by 


Cwlt) = cw(t-— 1) +n [at — ew(t — 1)], (10.69) 





E(t) ==, (¢-1) +n [x — cult — 1)| ee — cult- 1)”, (10.70) 
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where n is the learning rate. C-means is repeated until the changes in all c;(t) 
and &;(t) are within specified accuracy, that is, 


l|ei(t) — c(t — 1)|| < £o, (10.71) 


IE) -Elt -— 1)||p < £1, (10.72) 


where £o and ¢, are predefined small positive numbers. After complex RBF 
centers are determined, the weight matrix W is determined using the LS or the 
RLS algorithm. 

In the complex-valued RBF network [18], each RBF node has a real-valued 
response that can be interpreted as a conditional pdf. Because the RBF node’s 
response is real-valued, this complex-valued RBF network is essentially two sep- 
arate real-valued RBF networks. 

A fully complex-valued RBF network [19] has a complex-valued response at 
each RBF node. For regression problems, the locally regularized OLS algo- 
rithm aided with the D-optimality experimental design is extended to the fully 
complex-valued RBF network. A complex-valued orthogonal forward selection 
algorithm based on the multi-class Fisher ratio of class separability measure is 
derived for constructing sparse complex-valued RBF classifiers that generalize 
well. 

The complex RBF network [12] adopts the stochastic gradient learning algo- 
rithm to adjust the parameters. A complex-valued minimal RAN equalizer is 
developed in [41]. Applying the growing and pruning criteria, it realizes a more 
compact structure and obtains better performance than complex RBF and 
many other equalizers do. Although the inputs and centers of complex RBF 
and complex-valued minimal RAN are complex-valued, the basis functions still 
remain real-valued. 

Complex-valued self-regulating RAN [95] is an incremental learning algorithm 
for a complex-valued RAN with a self-regulating scheme to select the appropriate 
number of hidden neurons. It uses a sech activation function in the hidden layer. 
The network is updated using a complex-valued EKF algorithm. 

The fully complex ELM [61] uses any ETF as activation function. The fully 
complex ELM based channel equalizer significantly outperforms other equalizers 
based on complex-valued minimal RAN, complex RBF network [12] and complex 
BP in terms of symbol error rate and learning speed. 


A comparision of RBF networks and MLPs 


Both the MLP and the RBF network are used for supervised learning. In the RBF 
network, the activation of an RBF unit is determined by the distance between the 
input and prototype vectors. For classification problems, RBF units map input 
patterns from a nonlinearly separable space to a linearly separable space, and 
the responses of the RBF units form new feature vectors. Each RBF prototype 


ww ai bt. com DOOOO00 


Radial basis function networks 339 


is a cluster serving mainly a certain class. When the MLP with a linear output 
layer is applied to classification problems, minimizing the error at the output 
of the network is equivalent to maximizing the so-called network discriminant 
function at the output of the hidden units [104]. A comparison between the MLP 
and the localized RBF network is as follows. 


Global method vs. local method 

The use of the sigmoidal activation function makes the MLP a global method. For 
an input pattern, many hidden units will contribute to the network output. On 
the other hand, in the localized RBF network, each localized RBF covers a very 
small local zone. The local method satisfies the minimal disturbance principle 
[108], that is, the adaptation not only reduces the output error for the current 
example, but also minimizes disturbance to those already learned. 


Local minima 

Due to the sigmoidal function, the crosscoupling between hidden units of the 
MLP or recurrent networks results in high nonlinearity in the error surface, 
resulting in the problem of local minima or nearly flat regions. This problem 
gets worse as the network size increases. In contrast, the RBF network has a 
simple architecture with linear weights, and therefore has a unique solution to 
the weights. 


Approximation and generalization 

Due to the global activation function, the MLP has greater generalization for 
each training example, and thus the MLP is a good candidate for extrapolation. 
On the contrary, the extension of a localized RBF to its neighborhood is deter- 
mined by its variance. This localized property prevents the RBF network from 
extrapolation beyond the training data. 


Network resources and curse of dimensionality 

The localized RBF network, like most kernel-type approximation methods, suf- 
fers from the problem of curse of dimensionality. It typically requires much more 
data and more hidden units to achieve an accuracy similar to that of the MLP. 
In order to approximate a wide class of smooth functions, the number of hidden 
units required for the three-layer MLP is polynomial with respect to the input 
dimensions, while that for the localized RBF network is exponential [3]. The 
curse of dimensionality can be alleviated by using smaller networks with more 
adaptive parameters [81] or by progressive learning [25]. This requires a high 
number of training data and often leads to a poor ability to generalize. 


Hyperplanes vs. hyperellipsoids 

For the MLP, the response of a hidden unit is constant on a surface that consists 
of parallel (Jı — 1)-dimensional hyperplanes. As a result, the MLP is preferable 
for linearly separable problems. On the other hand, in the RBF network, the 
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activation of the hidden units is constant on concentric (Jı — 1)-dimensional 
hyperspheres or hyperellipsoids. Thus, the RBF network may be more efficient 
for linearly inseparable classification problems. 


Training speed and performing speed 

The error surface of the MLP has many local minima or large flat regions called 
plateaus, which lead to slow convergence of the training process. The MLP fre- 
quently gets trapped at local minima. For the localized RBF network, only a 
few hidden units have significant activations for a given input, thus the network 
modifies the weights only in the vicinity of the sample point and retains constant 
weights in the other regions. The RBF network requires orders of magnitude less 
training time for convergence than the MLP trained with BP to achieve compa- 
rable performance [9, 72]. For equivalent generalization performance, the MLP 
requires far fewer hidden units than the localized RBF network, thus the trained 
MLP is much faster in performing. 

Generally speaking, the MLP is a better choice if the training data is expensive. 
However, when the training data is cheap and plentiful or online training is 
required, the RBF network is desirable. 

Some properties of the MLP and the RBF network are combined for model- 
ing purposes. In the centroid-based MLP [58], a centroid layer is inserted into 
the MLP as the second layer. The conic-section function network [24] gener- 
alizes the activation function to include both the bounded (hypersphere) and 
unbounded (hyperplane) decision regions in one network. It can make automatic 
decisions with respect to the two decision regions. It combines the speed of the 
RBF network and the error minimization of the MLP. In [105], a hybrid RBF 
sigmoid neural network with a three-step training algorithm that utilizes both 
global search and gradient-descent training is proposed. The algorithm identifies 
global features of an input-output relationship before adding local detail to the 
approximating function. 


Example 10.2: Approximate the following function using the MLP and the RBF 
network: 


f (x1, £2) = 4(0.1 + £2 e”? (0.05 + at — 102222 + 52$)) cos(2721), 


£1, z2 E€ [—0.5, 0.5]. 


The number of epochs is selected as 1000, and the goal MSE is selected to be 
1078. A grid of 21 x 21 samples are generated as the training set. For the MLP 
we select 25 nodes, and LM as the training algorithm. For the RBF network 
we select 70 nodes, ø = 0.5, and OLS as the training algorithm. The simulation 
result is demonstrated in Fig. 10.7. The training and testing times for the MLP 
are 18.4208 s and 0.0511 s respectively, while those for the RBFN are 2.0622 s 
and 0.0178 s. 
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Figure 10.7 A comparison between the MLP and the RBF network. (a) The function to be 
approximated. (b) The MSE obtained by using the MLP and the RBF network. (c) The 
approximation error by the MLP. (d) The approximation error by the RBF network. 


10.12 Bibliographical notes 


The Gaussian RBF network is a popular receptive-field network. Another well- 
known receptive-field network is the cerebellar model articulation controller 
(CMAC) [2, 71] associative memory network inspired by the neurophysiological 
properties of the cerebellum. CMAC is a distributed look-up-table system and 
is also suitable for VLSI realization. It can approximate slow-varying functions, 
and is orders of magnitude faster than BP. However, CMAC may fail in approxi- 
mating highly nonlinear or rapidly oscillating functions [10]. Pseudo-self-evolving 
CMAC [96], inspired by the cerebellar experience-driven synaptic plasticity phe- 
nomenon observed in the cerebellum, nonuniformly allocates its computing cells 
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to overcome the architectural deficiencies encountered by the CMAC network. 
This is where significantly higher densities of synaptic connections are located 
in the frequently accessed regions. 

The generalized single-layer network [32], also known as the generalized linear 
discriminant, has a three-layer architecture similar to the RBF network. Each 
node in the hidden layer has a nonlinear activation function ¢;(-), and the out- 
put nodes implement linear combinations of the nonlinear kernel functions of the 
inputs. The RBF network is a type of generalized single-layer network. Like the 
OLS method, orthogonal methods in conjunction with some information criteria 
are usually used for self-structuring the generalized single-layer network to gen- 
erate a parsimonious, yet accurate, network [1]. The generalization ability of the 
generalized single-layer network is analyzed in [32] by using PAC learning the- 
ory and the concept of VC dimension. Necessary and sufficient conditions on the 
number of training examples are derived to guarantee a particular generalization 
performance of the generalized single-layer network [32]. 

The wavelet neural network [115, 116, 117] has the same structure as the RBF 
network, but uses wavelet functions as the activation function for the hidden 
units. Due to the localized properties in both the time and frequency domains of 
wavelet functions, wavelets are locally receptive field functions that approximate 
discontinuous or rapidly changing functions. The wavelet neural network has 
become a popular tool for function approximation. Wavelets with coarse resolu- 
tion can capture the global or low-frequency feature easily, while wavelets with 
fine resolution can capture the local or high-frequency feature of the function 
accurately. This distinguished characteristic leads the wavelet neural network to 
fast convergence, easy training and high accuracy. 


10.1 Plot the following two RBFs [81, 69] 


1 
o(r) = CETA a >0, 
glr) = (0? + r?)°, 0<8<1, 


where r > 0 denotes the distance from data point æ to center c, g is used to con- 
trol the smoothness of the interpolating function. When 3 = 4, the RBF becomes 
Hardy’s multiquadric function, which is extensively used in surface interpolation 
with very good results [81]. 


10.2 Consider a 2-4-1 RBF network for XOR problem. Compute the linear 
weight w, if the Gaussian RBF is used: 


Mia ("="). rai 


20? 
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10.3 Derive the gradient-descent algorithm for RBF parameters given by 
(10.53)—(10.56). 


10.4 Use the RBF network to approximate the following functions: 
(a) h(x) = (1 — z + 2x7) exp(—x?/2), «x € [-10, 10]. 
(b) f(x) = 2sin (4) cos (#2), «e [0,10]. 


10.5 For the XOR problem, investigate the samples in the transformed space 
when using 

(a) the logistic function y(r) = —=. 

(b) the thin-plate spline ¢(r) = r? In(r). 

(c) the multiquadratic function y(x) = (a? + 1)'/?. 
(d) the inverse multiquadratic function p(x) = Gav 

(e) Are the four points linearly separable in the transformed space? 


10.6 For the Gaussian RBF network, derive the Jacobian matrix J = Ea 
and the Hessian matrix H = OB , where w;, w; denote any two parameters 
Owj Ow; J 


such as network weights, RBF centers and width. 
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11.1 


Recurrent neural networks 


Introduction 


The brain is a strongly recurrent structure. This massive recurrence suggests a 
major role of self-feeding dynamics in the processes of perceiving, acting and 
learning, and in maintaining the organism alive. Recurrent networks harness the 
power of brain-like computing. There is at least one feedback connection in recur- 
rent networks. When recurrent networks are used, the network size is significantly 
compact compared with feedforward networks for the same approximation accu- 
racy of dynamic systems. The MLP is fundamentally limited in its ability to solve 
topological relation problems. Recurrent networks can also be used as associative 
memories to build attractors y,, from input-output association a Yp}. 

The MLP is purely static and is incapable of processing time information. One 
can add a time window over the data to act as a memory for the past. In the 
applications of dynamical systems, we need to forecast an input at time t+ 1 
from the network state at time t. The resulting network model for modeling a 
dynamical process is referred to as a temporal association network. Temporal 
association networks must have a recurrent architecture so as to handle the 
time-dependent nature of the association. 

To generate a dynamic neural network, memory must be introduced. The 
simplest memory element is the unit time delay, which has the transfer function 
H(z) = z~t. The simplest memory architecture is the tapped delay line consisting 
of a series of unit time delays. Tapped delay lines are the basis of traditional 
linear dynamical models such as finite impulse response (FIR) or infinite impulse 
response (IIR) models. An MLP may be made dynamic by introducing time delay 
loops to the input, hidden, and/or output layers. The memory elements can be 
either fully or sparsely interconnected. A network architecture incorporating time 
delays is the time-delay neural network [46]. 

Recurrent networks are dynamical systems with temporal state representa- 
tions. They are computationally powerful, and can be used in many temporal 
processing models and applications. Moreover, since the recurrent networks are 
modeled by systems of ordinary differential equations, they are also suitable for 
digital implementation using standard software for integration of ordinary differ- 
ential equations. The Hopfield model and the Cohen-Grossberg model are the two 
common recurrent network models. The Hopfield model can store information 
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in a dynamically stable structure. The Boltzmann machine is a generalization of 
the Hopfield model. 

Recurrent networks can generally be classified into globally recurrent net- 
works, in which feedback connections between every neurons are allowed, and 
locally recurrent, globally feedforward networks [44] with the dynamics realized 
inside neuron models. Both classes of models can be universal approximators for 
dynamical systems [18, 17]. In general, globally recurrent networks suffer from 
stability problems during training, and require complicated and time-consuming 
training algorithms. In contrast, locally recurrent networks are designed with 
dynamic neuron models which contain inner feedbacks, but interconnections 
between neurons are strict feedforward ones just as in the case of the MLP. 
They have a less complicated structure and yield simpler training. They allow 
for easy checking of the stability by examining poles of their internal filters. 
Explicit incorporation of past information into an architecture can be easily 
implemented. Analytical results show that a locally recurrent network with two 
hidden layers is able to approximate a state-space trajectory produced by any 
Lipschitz continuous function with arbitrary accuracy [27]. 

Two discrete-time formulations of recurrent networks are the time-delayed 
recurrent network and the simultaneous recurrent network. The time-delayed 
recurrent network is trained so as to minimize the error in prediction. By con- 
trast, the simultaneous recurrent network is not intended to provide better fore- 
casting over time or to provide memory of past or future, but rather uses recur- 
rence to provide general function approximation capability, based on concepts in 
Turing theory and complexity theory. The simultaneous recurrent network is a 
powerful function approximator [51]. It has been shown experimentally that an 
arbitrary function generated by an MLP can always be learned by a simultaneous 
recurrent network. However, the opposite is not true. 

The cellular structure-based simultaneous recurrent network has some inter- 
esting similarity to the hippocampus [51]. It is a function approximator that is 
more powerful than the MLP [13]; it can realize a desired mapping with much 
lower complexity than the MLP can. A generic cellular simultaneous recurrent 
network is implemented by training the network with EKF [13]. The cell is a 
generalized MLP. Each cell has the same weights, and this allows for arbitrarily 
large networks without increasing the number of weight parameters. 

Genetic regulatory networks can be described by nonlinear differential equa- 
tions with time delays. Delay-independent stability of two genetic regulatory 
networks, namely a real-life repressilatory network with three genes and three 
proteins, and a synthetic gene regulatory network with five genes and seven 
proteins, are analyzed in [54]. 
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Figure 11.1 Architecture of a fully connected recurrent network of J neurons. 


11.2 


Fully connected recurrent networks 


For a recurrent network of J units, we denote the input of unit i by x; and the 
output or state of unit i as y;. The architecture of a fully connected recurrent 
network is illustrated in Fig. 11.1. 

The dynamics of unit 7 with sound biological and electronic motivation are 





given by [12] 
J 
neti(t) =X wayt) + rlt), *=1,...,J, (11.1) 
j=1 
i(t : 
UO = y(t) + 6 (net) tat), =, (11.2) 


where 7; is a time constant, net; is the net input to unit i, ¢(-) is a sigmoidal 
function, and input x;(t) and output y;(t) are continuous functions of time. In 
(11.2), —y;(t) denotes natural signal decay. 

Any continuous state-space trajectory can be approximated to any desired 
degree of accuracy by the output of a sufficiently large continuous-time recurrent 
network described by (11.1) and (11.2) [8]. In other words, the recurrent network 
is a universal approximator of dynamical systems. The universal approximation 
capability of recurrent networks has been investigated in [17, 19]. A fully con- 
nected discrete-time recurrent network with the sigmoidal activation function is 
a universal approximator of discrete- or continuous-time trajectories on compact 
time intervals [17]. A continuous-time recurrent network with the sigmoidal acti- 
vation function and external input can approximate any finite-time trajectory of 
a dynamical time-variant system [19]. 


Turing capabilities 

The neural Moore machine is the most general recurrent network architecture. It 
is the neural network version of the Moore machine, which is a type of finite-state 
machine. Elman’s simple recurrent net [6] is a widely used neural Moore machine. 
All general digital computers have some common features. The programs exe- 
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Figure 11.2 Architecture of the time-delay neural network. 


11.3 


cutable on them form the class of recursive functions, and the model describing 
them is called a Turing machine. In a Turing machine, a finite automaton is 
used as a control or main computing unit, but this unit has access to potentially 
infinite storage space. 

In fact, while hidden Markov models (HMMs) and traditional discrete sym- 
bolic grammar learning devices are limited to discrete state spaces, recurrent 
networks are in principle suited to all sequence learning tasks due to their Tur- 
ing capabilities. Recurrent networks are Turing equivalent [38] and can therefore 
compute whatever function any digital computer can compute. The simple recur- 
rent network is proved to have a computational power equivalent to that of any 
finite-state machine [38]. 


Theorem 11.1 (Siegelmann and Sontag, 1995 [38]). All Turing machines 
can be simulated by fully connected recurrent networks built on neurons with 
sigmoidal activation functions. 


Time-delay neural networks 


The time-delay neural network [46] maps a finite-time sequence {x(t), x(t — 
1),..., a(t — m)} into a single output y(t). It is a feedforward network equipped 
with time-delayed versions of a signal a(t) as input. BP can be used to 
train the network. The architecture of a time-delay neural network using a 
three-layer MLP is illustrated in Fig. 11.2. The input to the network is a 
vector composing m + 1 continuous samples. If the input to the network is 
xı = (x(t), x(t — 1),..., z(t —m))” at time t, then it is a4; = (z(t +i), c(t + 
i—1),...,2(t+i—m))” at time t+ i. The model has been successfully applied 
to speech recognition [46] and time-series prediction. The architecture can be 
generalized when the input and output are vectors. This network practically 
functions as an FIR filter. A time-delay neural network is not a recurrent network 
since there is no feedback and it preserves its dynamic properties by unfolding 
the input sequence over time. 
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In Fig. 11.2, if the single output y(t + 1) is applied to a tapped-delayed-line 
memory of p units and the p delayed replicas of y(t + 1) are fed back to the input 
of the network, the input to this new recurrent network is then (a7 (t); y” (t))” = 
(x(t), x(t —1),...,a(t—m);y(t), y(t —1),..., y(t — p))” and the output of the 
network is y(t + 1). Vector a(t) is an exogenous input originating from outside 
the network, and y(t) is regression of the model output y(t + 1). The model is 
called a nonlinear autoregressive with exogenous inputs (NARX) model. It is a 
delay recurrent network with feedback. 

As opposed to other recurrent networks, NARX networks have a limited feed- 
back which comes only from the output neuron rather than from the hidden 
states. The NARX networks with a finite number of parameters are construc- 
tively proved to be computationally as strong as fully connected recurrent net- 
works, and thus the Turing machines [39]. The computational power of NARX 
network is at least as great as that of Turing machines [39]. 


Theorem 11.2 (Siegelmann, Horne and Giles, 1997 [39]). NARX net- 
works with one layer of hidden neurons with bounded, one-sided saturated acti- 
vation functions and a linear output neuron can simulate fully connected recur- 
rent networks with bounded, one-sided saturated activation functions, except for 
a linear slowdown. 


A linear slowdown means that if the fully connected recurrent network with 
N neurons computes a task of interest in time T, then the total time taken by 
the equvalent NARX network is (N + 1)T. By a minor modification, the logistic 
function can be made a bounded, one-sided saturated function. From Theorems 
11.1 and 11.2, NARX networks with one hidden layer of neurons with bounded, 
one-sided saturated activation functions and a linear output neuron are Turing 
equivalent. 

By replacing each synapse weight with a linear, time-invariant filter, the MLP 
can be used for temporal processing [36]. When the filter is an FIR filter, we 
get an FIR neural network. The FIR MLP can be implemented as a resistance- 
capacitance model [36]. One can use a temporal extension of BP to train the 
FIR MLP (47, 48]. Once the network is trained, all the weights are fixed and 
the network can be used as an MLP. The time-delay neural network is function- 
ally equivalent to the FIR network [48]. It can be easily related to a multilayer 
network by replacing the static synaptic weights with FIR filters [48]. 

The time-delay neural network can be an MLP-based [46] or an RBF net- 
work based [2] temporal neural network for nonlinear dynamics and time-series 
learning. Both approaches use the same spatial representation of time. 


Example 11.1: One of the most well-known applications of the MLP is the speech 
synthesis system NETtalk [35]. NETtalk is a three-layer classification network, 
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Figure 11.3 Schematic drawing of the NETtalk architecture. 


as illustrated in Fig. 11.3, that translates English letters into phonemes. A string 
of characters forming English text is then converted into a string of phonemes, 
which can be further sent to an electronic speech synthesizer to produce speech. 


The MLP-based time-delay neural network is applied. The network has a total 
of 309 nodes and 18,629 weights. Both the BP and Boltzmann learning algo- 
rithms are applied to train the network. The hidden units play the same role 
as a rule extractor. The training set consists of a corpus from a dictionary, and 
phonetic transcriptions from informal, continuous speech of a child. NETtalk is 
suitable for fast implementation without any domain knowledge, while the devel- 
opment of conventional rule-based expert systems such as DECtalk needs years 
of group work. 


Example 11.2: The adaline network is a widely used neural network found in 
practical applications. Adaptive filtering is one of its major application areas. 
We use the adaline network with the tapped delay line. The input signal enters 
from the left and passes through N — 1 delays. The output of the tapped delay 
line is an N-dimensional vector, made up of the input signal at the current 
and previous instances. The network is just an adaline neuron. In digital signal 
processing, this neuron is referred to as an FIR filter. 

We use an adaptive filter to predict the next value of a stationary random 
process, p(t). Given the target function p(t), we use a time-delay adaline network 
to train the network. The input to the network is the previous five data samples 
(p(t — 1), p(t — 2),..., p(t — 5))T, and the output approximates the target p(t). 
The learning rate is set as 0.05. The approximation result for 10 adaptation 
passes is shown in Fig. 11.4a. 

We solve the same problem using MLP-based time-delay neural network with 
1 hidden neuron. BP with a learning rate 0.05 is used for training, and training 
is performed for 10 adaptation passes. The result is shown in Fig. 11.4b. The 
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Figure 11.4 Prediction using a time-delay network. (a) Result for 10 adaption passes using a time-delay 
adaline network. (b) Result for 10 adaptation passes using an MLP-based time-delay neural network. 


performance of the MLP-based method is poor, but can be improved by using 
more hidden nodes, a better training method, and more epochs. 


11.4 Backpropagation for temporal learning 


When recurrent feature is integrated into the MLP architecture, the new model 
is capable of learning dynamic systems. The BP algorithm is required to be 
modified accordingly. 

Temporal behavior can be modelled by a linear, time-invariant filter. The 
temporal behavior of synapse i of neuron j may be described by an impulse 
response hj;(t) that is a function of continuous time t, which corresponds to the 
weight wj; [36]. These filters can be implemented by using RC circuits. When 
the hidden and output neurons of an MLP all use the FIR model, such a neural 
network is referred to as an FIR MLP. 

An FIR MLP can be unfolded in time. This removes all the time delays in the 
network by expanding it into an equivalent but larger network so that standard 
BP algorithm can be applied to compute the instantaneous error gradients. One 
can implement forward or backward unfolding in time. The forward choice results 
in a network size that is linear with the total number of time delays and the 
total number of free parameters in the network, while the backward case yields 
a network size that grows geometrically with the number of time delays and 
layers. The forward choice is thus prefered. Temporal backpropagation learning 
[47] overcomes the drawbacks of the unfolded method. 

Time-lagged recurrent networks subsume many conventional signal processing 
structures, e.g., tapped delay lines, FIR and IIR filters. At the same time, they 
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can represent dynamic systems with strongly hidden states and possess universal 
approximation capability for dynamic systems. Recurrent MLPs are a special 
class of time-lagged recurrent networks; they have a layered connectivity pattern. 
The recurrent MLP architecture, in which the output of the MLP is fed back to 
the input of the network, is a universal approximator for any dynamical system 
[18]. A BP-like learning algorithm with robust modification is used to train the 
recurrent MLP. 

Recurrent BP is used to train fully connected recurrent networks, where the 
units are assumed to have continuous states [31, 1]. After training on a set of 
N patterns {(a,,y;,)}, the presentation of x; will drive the network output to 
a fixed attractor state y. Thus, the algorithm learns static mappings, and may 
be used as associative memories. The computational complexity is O (J 2) for 
a recurrent network of J units. When the initial weights are selected as small 
values, the network almost always converges to a stable point. 

The time-dependent recurrent learning (TDRL) algorithm is an extension of 
recurrent BP to dynamic sequences that produce time-dependent trajectories 
[28]. It is a gradient-descent method that searches for the weights of a continuous 
recurrent network to minimize the error function of the temporal trajectory of 
the states. 


Real-time recurrent learning 
The real-time recurrent learning (RTRL) algorithm [52] is used for training fully 
connected recurrent networks with discrete-time states. It is a modified BP algo- 
rithm, and is an online algorithm without the need for allocating memory pro- 
portional to the maximum sequence length. RTRL decomposes the error function 
in time and evaluates instantaneous error derivatives with respect to the weights 
at each time step. The performance criterion for RTRL is the minimization of 
the total error over the entire temporal interval. In RTRL, calculation of the 
derivatives of node outputs with respect to the network weights must be carried 
out during the forward propagation of signals in a network. RTRL is suitable for 
tasks that require retention of information over fixed or indefinite time length, 
and it is best suitable for real-time applications. The normalized RTRL algo- 
rithm [26] normalizes the learning rate of RTRL at each step so that one has the 
optimal adaptive learning rate for every discrete time instant. The algorithm has 
a posteriori learning in the recurrent networks. Normalized RTRL is faster and 
more stable than RTRL. In [15], RTRL is extended to its complex-valued form 
where the inputs, outputs, weights and activation functions are complex-valued. 
For a time-lagged recurrent network, the computational complexity of RTRL 
scales by O(J*) for J nodes (in the worst case), with the storage of all variables 
scaling by O(J?) [53]. Furthermore, RTRL requires that the dynamic derivatives 
be computed at every time step for which the time-lagged recurrent network 
is executed. Such coupling of forward propagation and derivative calculation is 
due to the fact that in RTRL both the derivatives and the time-lagged recurrent 
network node outputs evolve recursively. 
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The delayed recurrent network is a continuous-time recurrent network hav- 
ing time-delayed feedbacks. Due to the presence of time delay in the differential 
equation, the system has an infinite number of degrees of freedom. The TDRL 
and RTRL algorithms are introduced into the delayed recurrent network [43]. 
A comparative study of the recurrent network and the time-delay neural net- 
work has been made in terms of the learning algorithms, learning capability, and 
robustness against noise in [43]. 

Complex-valued RTRL algorithms using split complex activation functions 
[15], [5] do not follow the generic form of their real RTRL counterparts. A 
complex-valued RTRL algorithm using a general fully complex activation func- 
tion [9] represents a natural extension of the real-valued RTRL. An augmented 
complex-valued EKF algorithm [10] is achieved based on augmented complex 
statistics and the use of fully complex nonlinear activation functions. 


BP through time 

BP through time (BPTT) is the most popular method for performing supervised 
learning of recurrent networks [34, 50]. It is an adapted version of BP for recur- 
rent networks. BPTT is a method for unfolding a recurrent network in time to 
make an equivalent feedforward network each time a sequence is processed so 
that the derivatives can be computed via standard BP. The main limitation of 
BPTT is the static interval unfolded in time, red which is unable to accommo- 
date the processing of newly arrived information. Normally, BPTT truncates the 
continuous input-output sequence by a length n, which defines the number of 
time intervals to unfold and is the size of buffer memory to train the unfolded 
layers. It means that BPTT cannot take care of the sequences before n time 
steps and that the main memory is in external buffer memory except for the 
weights. 

The use of normal recurrent network and BPTT in incremental learning 
destroys the memory of past sequences. For a long sequence, the unfolded network 
may be very large, and BPTT will be inefficient. Truncated BPTT alleviates this 
problem by ignoring all the past contributions to the gradient beyond certain 
time into the past [53]. For BPTT(h), the computational complexity scales by 
O(hJ*) and the required memory is O(hJ), for truncation length h. BPTT(h) 
leads to more stable computation of dynamic derivatives than forward methods 
do because it utilizes only the most recent information in a trajectory. The use 
of BPTT(h) permits training to be carried out asynchronously with execution 
of the time-lagged recurrent network. 

Simultaneous recurrent networks trained by truncated BPTT with EKF are 
used for training weights in WTA networks with a smooth, nonlinear activation 
function [4]. BPTT is used for obtaining temporal derivatives, whereas EKF is 
the weight update method utilizing these derivatives. 

For an overview of various gradient-based learning algorithms for recurrent 
networks, see [29]. A comprehensive analysis and comparison of BPTT, recurrent 
BP and RTRL is given in [53]. 
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RBF networks for modeling dynamic systems 


The sequential RBF network learning algorithms, such as the RAN family, are 
capable of modifying both the network structure and the output weights online; 
thus, these algorithms are particularly suitable for modeling dynamical time- 
varying systems, where not only the dynamics but also the operating region 
changes with time. 

In order to model complex nonlinear dynamical systems, the state-dependent 
autoregressive (AR) model with functional coefficients is often used. The RBF 
network can be used as a nonlinear AR time-series model for forecasting applica- 
tion [37]. It can also be used to approximate the coefficients of a state-dependent 
AR model, thus yielding the RBF-AR model [45]. The RBF-ARX model is an 
RBF-AR model with an exogenous variable [30]. The RBF-ARX model usually 
uses far fewer RBF centers when compared to the RBF network. 

For time-series applications, the input to the network is a(t) = 
(y(t —1),...,y(t— ny))", and the network output is y(t). The dual-orthogonal 
RBF network algorithm is specially designed for nonlinear time-series prediction 
[3]. 

For online adaptation of nonlinear systems, a constant exponential forgetting 
factor is commonly applied to all the past data uniformly. This is incorrect for 
nonlinear systems whose dynamics are different in different operating regions. 
In [55], online adaptation of the Gaussian RBF network is implemented using 
a localized forgetting method, which sets different forgetting factors in different 
regions according to the response of the local prototypes to the current input 
vector. The method is applied in conjunction with recursive OLS and the com- 
putation is very efficient. 

The spatial representation of time in the time-delay neural network model is 
inconvenient and also the use of temporal window imposes a limit on the sequence 
length. Recurrent RBF networks, which combine features from the recurrent net- 
work and the RBF network, are suitable for the modeling of nonlinear dynamic 
systems [7]. The recurrent RBF network introduced in [7] has a four-layer archi- 
tecture, with an input layer, an RBF layer, a state layer and a single-neuron 
output layer. The state and output layers use the sigmoidal activation function. 

Real-time approximators for continuous-time dynamical systems with many 
inputs are presented in [20]. These approximators employ a self-organizing RBF 
network, whose structure varies dynamically to keep a specified approximation 
accuracy by adding or pruning online. The performance of this variable structure 
RBF network approximator with both the Gaussian RBF and the raised-cosine 
RBF is analyzed. The compact support of the raised-cosine RBF enables faster 
training and easier output evaluation of the network, compared to the case with 
the Gaussian RBF. 
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Some recurrent models 


In [16], the complex-valued recurrent network is used as an associative mem- 
ory of temporal sequences. It has a much superior ability to deal with temporal 
sequences than the real-valued counterpart does. One of the examples is memo- 
rization of melodies. The network can memorize plural melodies and recall them 
correctly from any part. 

A special kind of recurrent network for online matrix inversion is designed 
based on a matrix-valued error function instead of a scalar-valued error function 
in [56]. Its discrete-time model is investigated in [57]. When the linear activation 
function and a unit step size are used, the discrete-time model reduces exactly 
to Newton iteration for matrix inversion. 

The backpropagation-decorrelation rule [41] combines three principles: one- 
step backpropagation of errors, the use of the temporal memory in the network 
dynamics, and the utilization of a reservoir of inner neurons. The algorithm 
adapts only the output weights of a possibly large network and therefore can learn 
with a complexity of O(N). A stability analysis of the algorithm is provided based 
on nonlinear feedback theory in [42]. Backpropagation-decorrelation learning is 
further enhanced with an efficient online rescaling algorithm to stabilize the 
network while adapting. 

The stability of dynamic BP training is studied by the Lyapunov method in 
[21]. A robust adaptive gradient-descent algorithm [40] for the recurrent network 
is similar, in some ways, to the RTRL algorithm in terms of using a specifically 
designed derivative based on the extended recurrent gradient to approximate the 
true gradient for real-time learning. It switches the training patterns between 
standard online BP and RTRL according to the derived convergence and sta- 
bility conditions so as to optimize the convergence speed of robust adaptive 
gradient descent and to make an optimal tradeoff between the online BP and 
RTRL training strategies to maximize the learning speed. The optimized adap- 
tive learning maximizes the training speed of the recurrent network for each 
weight update without violating the stability and convergence criteria. Robust 
adaptive gradient descent provides improved training speed over RTRL with less 
discrete time steps of transit and smaller steady-state error. The method uses 
three adaptive parameters to adjust the effective adaptive learning rate and to 
provide guaranteed weight convergence and system stability for training. 

The simplex and interior-point algorithms are two effective methods for the LP 
problem. A one-layer recurrent network with a discontinuous activation function 
is proposed for LP in [22]. The number of neurons is equal to that of the decision 
variables. The neural network with a sufficiently high gain is proven to be globally 
convergent to the optimal solution. 
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Elman networks 

A simultaneous recurrent network in its simplest form is a three-layer feedforward 
network with the self-recurrent hidden layer [6]. The Elman recurrent network 
topology consists of feedback connections from every hidden neuron output to 
every hidden neuron input via a context layer. It is a special case of a general 
recurrent network and could thus be trained with full BPTT and its variants. 
Instead of regarding the hidden layer as self-recurrent, the activities of the hid- 
den neurons are stored into a context layer in each time step and the context 
layer acts as an additional input to the hidden layer in the next time step. 
Elman BP is a variant of BPTT(n), with n = 1. An Elman recurrent network 
can simulate any given deterministic finite-state automaton. The attention-gated 
reinforcement learning scheme [33] effects an amalgamation of BP and reinforce- 
ment learning for feedforward networks in classification tasks. These ideas are 
recast to simultaneous recurrent networks in prediction tasks, resulting in the 
reimplementation of Elman BP as a reinforcement scheme [11]. 

Elman networks are not as reliable as some other kinds of networks, because 
both training and adaptation happen using an approximation of the error gra- 
dient due to the delays in Elman networks. For Elman networks, we do not rec- 
ommend algorithms that take large step sizes, such as trainlm and trainrp. An 
Elman network needs more hidden neurons in its hidden layer than are actually 
required for a solution by an other method. While a solution might be available 
with fewer neurons, the Elman network is less able to find the most appropriate 
weights for hidden neurons due to the approximated error gradient. 

A general recursive Bayesian LM algorithm is derived to sequentially update 
the weights and the covariance (Hessian) matrix of the Elman network for 
improved time-series modeling in [25]. The approach employs a principled han- 
dling of the regularization hyperparameters. The recursive Bayesian LM algo- 
rithm outperforms standard RTRL and EKF algorithms for training recurrent 
networks on time-series modeling. 


Example 11.3: A problem where temporal patterns are recognized and classified 
with a spatial pattern is amplitude detection. Amplitude detection requires that 
a waveform be presented to a network through time, and that the network out- 
put the amplitude of the waveform. It demonstrates the Elman network design 
process. 

We create an Elman network with one hidden layer of 10 nodes, and the 
maximum number of epochs is 1000. The training algorithm is gradient descent. 
The performance for training and generalization is shown in Fig. 11.3. It is shown 
that the trained Elman network has good generalization performance for an input 
signal of varying amplitude. 
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Figure 11.5 Training the Elman network for amplitude detection. (a) The training MSE. (b) The 
network output after training. (c) The generalization of the trained network for different input 


amplitudes. 


11.7 


Reservoir computing 


Reservoir computing is a technique for efficient training of recurrent networks. 
It refers to a class of state-space models with a fixed state transition structure 
(the reservoir) and an adaptable readout form the state space. The reservoir is 
supposed to be sufficiently complex so as to capture a large number of features 
of the input stream that can be exploited by the reservoir-to-output readout 
mapping. The idea of using a randomly connected recurrent network for online 
computation on an input sequence was introduced in the echo state network [14] 
and liquid state machine [24]. Echo state networks use analogue sigmoid neurons, 
and liquid state machines use spiking neurons. 

Reservoir computing models are dynamical models for processing time series 
that make a conceptual separation of the temporal data processing into two 
parts: representation of temporal structure in the input stream through a non- 
adaptable dynamic reservoir, and a memoryless easy-to-adapt readout from the 
reservoir. Reservoir computing subsumes the idea of using general dynamical 
systems, the so-called reservoirs, in conjunction with trained memoryless read- 
out functions as computational devices. For a review of reservoir computing refer 
to [23]. Liquid state machines and echo state networks are investigated in the 
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Figure 11.6 Echo state network architecture. 


reservoir computing field, assuming that a large number of randomly connected 
neurons form a reservoir providing memory for different aspects of the signals. 
Readout neurons extract from this reservoir stable information in real time. 

For efficient training of large recurrent networks, instead of training all the 
weights, the weights are initialized randomly and the desired function is imple- 
mented by a full instantaneous linear mapping of the neuron states. An echo 
state network is a recurrent discrete-time network with a nontrainable sparse 
recurrent part (reservoir) and a simple linear readout. It uses analog sigmoidal 
neurons as network units. It has K input units, N internal (reservoir) units and L 
output units, as shown in Fig. 11.6, where u = (u1, u2,..., ug)” is the input vec- 
tor, £ = (21,22,...,2y)" is the internal state vector, and y = (y1, Y2, ---, yz)" 
is the output vector. The performance is largely independent of the sparsity of 
the network or the exact network topology. Connection weights in the reservoir 
and the input weights are randomly generated. The reservoir weights are scaled 
so as to ensure the echo state property: the reservoir state is an echo of the 
entire input history. The echo state property is a condition of asymptotic state 
convergence of the reservoir network, influenced by driving input. A memoryless 
readout device is then trained in order to approximate from this echo a given 
time-invariant target operator with fading memory, whereas the network itself 
remains untrained. 

Liquid state machine has been introduced as a spiking neural network model, 
inspired by the structural and functional organization of the mammalian neocor- 
tex [24]. It uses spiking neurons connected by dynamic synapses to project inputs 
into a high-dimensional feature space, allowing classification of inputs by linear 
separation, similar to the SVM approach. The key idea is to use a large but fixed 
recurrent part as a reservoir of dynamic features and to train only the output 
layer to extract the desired information. The training thus consists of a linear 
regression problem. Liquid state machines exploit the power of recurrent spik- 
ing neural networks without training the spiking neural network. They can yield 
competitive results; however, the process can require numerous time-consuming 
epochs. Introducing small-world statistics clearly leads to an increase in perfor- 
mance. Small-world architectures are common in biological neuronal networks 
and many real world networks. A small-world network [49] is a type of graph in 
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A) ayy 


Figure 11.7 Figure for Problem 11.2. 


Problems 


which nodes are highly clustered compared to a random graph, and a short path 
length exists between any two nodes in the network. 

The reservoir of a quantized echo state network is defined as a network of 
discrete-valued units, where the number of admissible states of a single unit 
is controlled by a parameter called state resolution, which is measured in bits. 
Liquid state machines and echo state networks can thus be interpreted as the two 
limiting cases of quantized echo state networks for low and high state resolution, 
respectively. 

Reservoir construction is largely driven by a series of randomized model- 
building stages, which rely on a series of trials and errors. Typical model con- 
struction decision of an echo state network involves setting the reservoir size, the 
sparsity of the reservoir and input connections, the ranges for random input and 
reservoir weights, and the reservoir matrix scaling parameter a. A simple, deter- 
ministically constructed cycle reservoir is comparable to the standard echo state 
network methodology [32]. The (short-term) memory capacity of linear cyclic 
reservoirs can be made arbitrarily close to the proved optimal value. 


11.1 Consider the Machey-Glass chaotic time series, generated by 
dx(t) O ax(t — T) 
dt  14+210(t—7) 





— bax(t) 


with the parameter a = 0.2, b= 0.1, and 7 = 21. 

(a) Design a time-delay neural network to predict 5 samples into the future. 
(b) Train the time-delay neural network with 500 points of the time series, and 
then use the trained network to predict on the next 200 points. Plot the predic- 
tion results, and give the prediction performance. 


11.2 In order to use BPTT, construct a multilayer feedforward network by 
unfolding the recurrent network shown in Fig. 11.7. 


11.3 Consider the following problem in discrete time: 


T 
y(k + 1) = 0.5y(k) + 0.1y(k) B ylk — i)| +1.2u(k — 9Ju(k) + 0.2, 


i=0 





where random input u(k) is uniformly drawn from [0,1] as input. Predict the 
next output y(k + 1). 
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11.4 An autoregressive AR(4) complex process is given by 





r(k) = 1.8r(k — 1) — 2.0r(& — 2) + 1.2r(k — 3) — 0.4r (k — 4) + n(k), 


where the input n(k) is colored noise. Predict the output using the RBF-AR 
model. 
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12.1 


Principal component analysis 


Introduction 


Most signal-processing problems can be reduced to some form of eigenvalue or 
singular-value problems. EVD and SVD are usually used for solving these prob- 
lems. PCA is a classical statistical method for data analysis [51]. It is related 
to EVD and SVD. Minor component analysis (MCA), as a variant of PCA, is 
most useful for solving the total least squares (TLS) problem. There are many 
neural network models and algorithms for PCA, MCA and SVD. These algo- 
rithms are typically based on unsupervised learning. They significantly reduce 
the cost for adaptive signal, speech, image and video processing, pattern recogni- 
tion, data compression and coding, high-resolution spectrum analysis, and array 
signal processing [27]. 

Stochastic approximation theory [82], first introduced by Robbins and Monro 
in 1951, is now an important tool for analyzing stochastic discrete-time systems 
including the classical gradient-descent method. 

Given a stochastic discrete-time system of the form 


Az(t) = z(t + 1) — z(t) = n(t) (F(z, t) + n(t)), (12.1) 


where z is the state vector, f(z,t) is a finite nonzero vector with functions 
as entries, and n(t) is an unbiased noisy term at a particular instant. The 
continuous-time representation is very useful for analyzing the asymptotic behav- 
ior of the algorithm. According to stochastic approximation theory, assuming 
that {7(t)} is a sequence of positive numbers satisfying the Robbins-Monro con- 
ditions [82] 

CO Co 

So n(t) = 00, 5 n? (t) < œ, (12.2) 

t=1 t=1 
the analysis of the stochastic system (12.1) can be transformed into the analysis 
of a deterministic differential equation 


dz 
s f(z,t). (12.3) 


If all the trajectories of (12.3) converge to a fixed point z*, the discrete-time 
system z(t) — z* with probability one as t > oo. By (12.2), it is required that 
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n(t) + 0 ast — oo. 7(t) is typically selected as n(t) = — with a constant a > 0, 
1 
2 


att 
< 2 <1 [97], [119]. 


Hebbian learning rule 


The classical Hebbian synaptic modification rule [47] states that biological synap- 
tic weights change in proportion to the correlation between the pre- and postsy- 
naptic signals. For a single neuron, the Hebbian rule can be written as 


w(t +1) = w(t) + ny(t)ar, (12.4) 


where the learning rate 7 > 0, w € R” is the weight vector, x, € R” is an input 
vector at time t, and y(t) is the neuron output defined by 


y(t) = w” (t)ær. (12.5) 


For stochastic input vector a, assuming that a and w are uncorrelated, the 
expected weight change is given by 


E[Aw] = 7Elyz] = nE [exw] = nCE[w, (12.6) 


where Ef-] is the expectation operator, and C = E lea? | is the autocorrelation 
matrix of x. 

At equilibrium, E[Aw] = 0; hence, we have a deterministic equation Cw = 0. 
Due to the effect of noise terms, C is a full-rank positive-definite Hermitian 
matrix with positive eigenvalues \;, i = 1, 2,..., n, and the corresponding orthog- 
onal eigenvectors c;, where n = rank(C). Thus, w = 0 is the only equilibrium 
state. 

Equation (12.4) can be represented in continuous-time form 


Ù = yz. (12.7) 


Taking statistical averaging, we have 


E[w] = E[yx] = CE/w). (12.8) 
This can be derived by minimizing the average instantaneous criterion [44] 
1 1 
E [Enebb] = -3E [y7] = —5E [w"] CE[w], (12.9) 


where Eyepp is the instantaneous criterion. At equilibrium, E [Enere | = 
—CE|w] = 0, thus w = 0. Since the Hessian E[H(w)] = E — = —C is 
nonpositive, the solution w = 0 is unstable, which drives w to infinite mag- 
nitude with a direction parallel to that of the eigenvector of C corresponding to 
the largest eigenvalue [44]. 

To prevent the divergence of the Hebbian rule, one can normalize ||w]|| to unity 
after each iteration [116], and this leads to the normalized Hebbian rule. Other 
methods such as Oja’s rule [95], Yuille’s rule [151], Linsker’s rule [79, 80], and 
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Hassoun’s rule [44] add a weight-decay term to the Hebbian rule to stabilize the 
algorithm. 


Oja’s learning rule 


Oja’s rule introduces a weight-decay term into the Hebbian rule and is given by 
[95] 


w(t +1) = w(t) + ny(tar — ny? (twt). (12.10) 


Oja’s rule converges to a state that minimizes (12.9) subject to ||w|| = 1. The 
solution is the principal eigenvector of C. For small 7, Oja’s rule is equivalent to 
the normalized Hebbian rule [95]. 

The continuous-time version of Oja’s rule is given by a nonlinear stochastic 
differential equation 


w = n (yz — y’ w). (12.11) 


The corresponding deterministic equation based on statistical averaging is thus 
derived as 


w = n [Cw — (w? Cw) w]. (12.12) 

At equilibrium, 
Cw = (w’ Cw) w. (12.13) 
It is easily seen that the solutions are w = +c¢;, i = 1,2,...,n with the corre- 





sponding eigenvalues A; arranged in a descending order Ay > Ag > ... Àn = 0. 
Notice that the average Hessian 
a T 
H(w) = ae [—Cw + (w’ Cw) w| 
= -C + w”CwI + 2ww*C (12.14) 





is positive-definite only at w = +c1, where I is the n x n identity matrix, if 
Ai Æ à2 [44]. Thus, Oja’s rule always converges to the principal component of 
C. 

The convergence analysis of the stochastic discrete-time algorithms such as 
the gradient-descent method is conventionally based on stochastic approximation 
theory [82]. A stochastic discrete-time algorithm is first converted into determin- 
istic continuous-time ordinary differential equations, and then analyzed by using 
Lyapunov’s second theorem. This conversion is based on the Robbins-Monro 
conditions, which require the learning rate to gradually approach zero as t > oo. 
This limitation is not practical for implementation, especially for learning non- 
stationary data. The stochastic discrete-time algorithms can be converted into 
their deterministic discrete-time formulations that characterize their average evo- 
lution from a conditional expectation perspective [156]. This method has been 
applied to Oja’s rule. Analysis based on this method guarantees the convergence 
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of Oja’s rule by selecting some constant learning rate for fast convergence. Oja’s 
rule is proved to almost always converge exponentially to the unit eigenvector 
associated with the largest eigenvalue of C, starting from points in an invariant 
set [150]. The initial vectors have been suggested to be selected from the domain 
of a unit hypersphere to guarantee convergence. It is suggested that 7 = 2518 
[150]. 


PCA: conception and model 


PCA is based on the spectral analysis of the second-order moment matrix, called 
the correlation matriz, that statistically characterizes a random vector. In the 
zero-mean case, the correlation matrix becomes the covariance matrix. For image 
coding, PCA is known as Karhunen-Loeve transform [83], which is an optimal 
scheme for data compression based on the exploitation of correlation between 
neighboring pixels or groups of pixels. PCA is directly related to SVD, and the 
most common way to perform PCA is via SVD of the data matrix. 

PCA allows the removal of the second-order correlation among given ran- 
dom processes. By calculating the eigenvectors of the covariance matrix of the 
input vector, PCA linearly transforms a high-dimensional input vector into a 
low-dimensional one whose components are uncorrelated. PCA is often based on 
optimization of some information criterion, such as maximization of the variance 
of the projected data or minimization of the reconstruction error. The objective 
of PCA is to extract m orthonormal directions W; E€ R”, i = 1,2,..., m, in the 
input space that account for as much of the data’s variance as possible. Subse- 
quently, an input vector x € R” may be transformed into a lower m-dimensional 
space without losing essential intrinsic information. The vector x can be repre- 
sented by being projected onto the m-dimensional subspace spanned by W; using 
the inner products x1 w;, yielding dimension reduction. 

PCA finds those unit directions W € R” along which the projections of the 
input vectors, known as the principal components, y = xTw, have the largest 


variance 
Epca (w) = E [y’] = w” CU = wee (12.15) 
where W = pef: Epc a (w) is a positive-semidefinite function. Setting se =0, 
we get 
Cw = w. (12.16) 


The solutions to (12.16) are w = ac;, i = 1,2,...,n, where a € R. When a = 1, 
w becomes a unit vector. In PCA, principal components are sometimes called 
factors or latent variables of the data. 
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We now examine the positive-definiteness of the Hessian of Epca(w) at w = 
c;. Multiplying the Hessian by c; leads to [44] 


0 t=) 
Hiei}es = { (Ai—Aj)e; 1 AG 
Thus, H(w) has the same eigenvectors as C but with different eigenvalues. H(w) 
is positive-semidefinite only when w = c1. As a result, w will eventually point 
in the direction of cı and Epca(w) takes its maximum value. 

By repeating maximization of Epca(w) but forcing w orthogonal to c1, the 
maximum of Epca(w) is equal to Ag at w = acy. Following this deflation pro- 
cedure, all the m principal directions W; can be derived [44]. The projections 
yi = 27 W;,i = 1,2,...,m, are the principal components of x. The result for 
two-dimensional input data is illustrated in Fig. 12.1. Each data point is accu- 
rately characterized by its projections on the two principal directions W, = ToT 
and W2 = Tey: If the data is compressed to one-dimensional space, each data 
point is then represented by its projection on eigenvector W1. 

A linear LS estimate & can be constructed for the original input x 


(12.17) 


b=) yT. (12.18) 
i=l 
The reconstruction error e is defined by 
e=a-%= ) yT: (12.19) 
i=m+1 


Naturally, e is orthogonal to ĉ. Each principal component y; is a Gaussian with 
zero mean and variance go? = \;. The variances of x, # and e can be, respectively, 
expressed as 





E [æl] = Soo? = SoM, (12.20) 
i=l i=l 

E [#7] = Soo? = $ A, (12.21) 
i=1 q=1 

Efile] = >> f= SO» (12.22) 
i=m+1 i=m4+1 


When we use only the first mı among the extracted m principal components to 

represent the raw data, we need to evaluate the error by replacing m by mı. 
Neural PCA originates from the seminal work by Oja [95]. Oja’s single-neuron 

PCA model is illustrated in Fig. 12.2. The output of the neuron is updated by 


y=w' a, (12.23) 


where w = (w1,...,wy,)’. Notice that the activation function is the linear func- 
tion (x) = z. 
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Figure 12.1 Illustration of PCA in two dimensions. 
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Figure 12.2 The single-neuron PCA model extracts the first principal component of C. 





Figure 12.3 Architecture of the PCA network 


The PCA network model was first proposed by Oja [97], where a J1-J2 feed- 
forward network, as shown in Fig. 12.3, is used to extract the first Jz principal 
components. The architecture of the PCA network is a simple expansion of the 
single-neuron PCA model. The output of the network is given by 


y= We a, (12.24) 


where y = (Y1, Ya; sas Ysa) t= (21,22, pea PI) 3 W= [w1, w2, piste , WJ), 
and w; = (wii, Wai,..., Wai) 


12.2.1 Factor analysis 


Factor analysis is a powerful multivariate analysis technique that identifies the 
common characteristics among a set of variables and has been widely used in 
many disciplines such as botany, biology, social sciences, economics and engi- 
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neering. Factor analysis is a linear latent variable scheme which is also used to 
capture local substructures. It models correlations in multidimensional data by 
expressing the correlations in a lower dimensional subspace, hence reducing the 
attribute space from a larger number of observable variables to a smaller number 
of latent variables called factors. 

Statistical factor analysis is given by a common latent variable model [126] 


a=Ay+ptn, (12.25) 


where the observed vector x is a J\-dimensional vector, the parameter matrix 
A = [aij] is a Jı x J2 matrix (with J2 < Jı) containing the factor loadings aij, 
the latent variable vector y is J2-dimensional vector whose entries are termed 
common factors, factors or latent variables, u is a mean vector, and n is a noise 
term. The entries in u and n are known as specific factors. When y and n are 
Gaussian, x also has a normal distribution. 

Factor analysis is related to PCA. It can be seen that PCA has exactly the 
same probalilistic linear model for (12.25). However, the factors estimated are 
not defined uniquely, but only up to a rotation. The ML method for fitting factor 
analysis is very popular and EM has slow convergence. 


Hebbian rule based PCA 


PCA maximizes the output variances E [y?] = E (wa)? = w! Cw; of the 
linear network under orthonormality constraints. In the hierarchical case, the 
constraints take the form wiw; = ðij, j <4, Oi; being the Kronecker delta. 
In the symmetric case, symmetric orthonormality constraints wiw; = dij are 
applied. The subspace learning algorithm (SLA) and generalized Hebbian algo- 
rithm (GHA) algorithms correspond to the symmetric and hierarchical network 
structures, respectively. 


Subspace learning algorithms 


By using Oja’s rule (12.10), w will converge to a unit eigenvector of the corre- 
lation matrix C, and the variance of the output y is maximized. For zero-mean 
input data, this extracts the first principal component [95]. We rewrite (12.10) 
for convenience of presentation 


w(t +1) = w(t) +n [y(t)ar — y?(t)w(t)] , (12.26) 


where y(t)a, is the Hebbian term, and —y?(t)w/(t) is a decaying term, which is 
used to prevent instability. In order to keep the algorithm convergent, 0 < n(t) < 
I is required [97], where A; is the largest eigenvalue of C. If y(t) > 5 w 
will not converge to +c [12]. One can select n(t) = 0.5 [æT x] at the beginning 
and gradually decrease it [97]. 
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Symmetrical SLA for the PCA network [97] can be derived by maximizing 
Esta = str (wcw) (12.27) 
subject to 
Ww =I, (12.28) 


where I is the J2 x Jz identity matrix. 
SLA is given by [97] 


wilt + 1) = wilt) + n(t)yi(t) [ae — ê], (12.29) 


&, = Wy. (12.30) 


After the algorithm converges, W is roughly orthonormal and the columns of 
W, namely, w;, i = 1,...,J2, converge to some linear combination of the first J2 
principal eigenvectors of C [97], which is a rotated basis of the dominant eigen- 
vector subspace. This analysis is called the principal subspace analysis (PSA). 
The value of w; is dependent on the initial conditions and training samples. 
The corresponding eigenvalues A;, i = 1,...,J2, approximate E [y?], which 
can be adaptively estimated by 
M(t+) = (1 = =) Ailt) + ile +1). (12.31) 


Weighted SLA can be derived by maximizing the same criterion (12.27), but 
the constraint (12.28) can be modified as [98] 


WIW =a, (12.32) 
where a = diag (a1, @2,.. . , @J3), is an arbitrary diagonal matrix with a; > a2 > 
anD ay, > 0. 

Weighted SLA is given by [96, 98] 
wilt +1) = wilt) + n(t)yi(t) [we — yi], i=1,..., J2, (12.33) 
ê = Wy, (12.34) 
where 7;, i = 1,..., J2, are any coefficients that satisfy 0 < y1 < Y2 <... < YJa: 


Due to the asymmetry introduced by 7;, w; almost surely converges to the 
eigenvectors of C. Weighted SLA can perform PCA; however, norms of the weight 
vectors are not equal to unity. 

SLA and weighted SLA are nonlocal algorithms that rely on the calculation 
of the errors and the backward propagation of the values between the layers. 

By adding one more term to the PSA algorithm, a PCA algorithm can be 
obtained [56]. This additional term rotates the basis vectors in the principal 
subspace toward the principal eigenvectors. PCA derived from SLA is given as 
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[56] 
wilt + 1) = wi(t) + n(t)yi(t) [we — 2] (12.35) 
+n(t)pi (yi(t)ae — wi(t)y7(t)) , (12.36) 


where 1 > |pi| > |p2| >... > |pz,|. This PCA algorithm generates weight vectors 
of unit length. 
The adaptive learning algorithm [12] is a PCA algorithm based on SLA. In 
this method, each neuron adaptively updates its learning rate by 
Bilt) 


mi(t) = X(t)’ (12.37) 





where 4;(t) is the estimated eigenvalue, which can be estimated using (12.31), 
Bi(t) is set to be smaller than 2(/2 — 1) and decreases to zero as t — oo. If 3;(t) 
is the same for all i, w(t) will quickly converge, at nearly the same rate, to Ci 
for all i in an order of descending eigenvalues. The adaptive learning algorithm 
[12] converges to the desired target both in the large eigenvalue case as well as 
in the small eigenvalue case, with performance better than that of GHA [119]. 

The modulated Hebbian rule [55] is a biologically inspired PSA. In order to 
achieve orthonormality, the modulated Hebb-Oja rule [55] performs Oja’s rule on 
separate weight vectors with only one difference: learning factor 7 is specifically 
programmed by the network. In this case 7 decreases as time approaches infinity. 
Unlike some other recursive PCA/PSA methods that use local feedback connec- 
tions in order to maintain stability, the modulated Hebb-Oja rule uses global 
feedback connection. Number of global calculation circuits is 2 in the modulated 
Hebb-Oja algorithm and J; in SLA. Number of calculations required for SLA is 
lower than, but very close to, the number of calculations required for modulated 
Hebb-Oja. For the modulated Hebb-Oja algorithm, we have [57] 


Awki = Ñ(EkYL — Wey?); (12.38) 
ù = nla" a —y"y), (12.39) 
y= We. (12.40) 


For Jı inputs and J2 outputs, Jz < Jı. The modulated Hebb-Oja algorithm is 
given by 


W(t +1) = Wt) + nt) (27 æ — y(t) y(t) - (aey(t)" — W(t)diag(y(t)y(¢)")). 
(12.41) 


Example 12.1: The concept of subspace is involved in many signal-processing 
problems. This requires the EVD of the autocorrelation matrix of a data set or 
the SVD of the crosscorrelation matrix of two data sets. This example illustrates 
the use of weighted SLA for extracting the multiple principal components. 
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The benchmark data set. 


Given a data set of 2000 vectors {ap € R?}, where ap = (Xp.1,%p,2,p3) . 
We take tpi =1+0.5N(0,1), tp2=(-1+2N(0,1))api1, and tp3 = (1+ 
2N(0,1))ap,2, where N(0,1) denotes a Gaussian noise term with zero-mean and 
unit variance. The autocorrelation matrix is calculated as C = zg pan ee 
The data set is shown in Fig. 12.4. Applying SVD, we get three eigenvalues in a 


descending order: 


w 1 = (0.0277, —0.0432, 0.9987)", ||w1|| = 1.0000, A; = 86.3399, 
wə = (—0.2180, 0.9748, 0.0482)", ||we|| = 1.0000, Az = 22.9053, 


ws = (0.9756, 0.2191, —0.0176)7, ||wa]| = 1.0000, Az = 3.2122. 


In this example, the simulation results slightly deviate from these values since 
we use only 2000 samples. 

For weighted SLA, we select y = (1,2,3)". We select the learning rate n; = 
IT where each time t corresponds to the presentation of a new sample. 
Training is performed for 10 epochs, and the training samples are provided in a 
fixed deterministic sequence. We calculate the adaptations for ||w;||, \;, and the 


cosine of the angle 0; between w; and c;: 


Ay = 84.2226, A2 = 20.5520, Az = 3.8760. 


cos@, = 1.0000, cos@2 = —1.0000, cos = — 1.0000. 


The adaptations for a random run are shown in Fig. 12.5. 
We selected y; = 0.17, i = 1, 2,3. The weight vectors converge to the directions 
or directions opposite to those of the principal eigenvectors, that is, cos 6; —> 





+1 as t — oo. The converging à; and ||w,|| do not converge to their respective 
theoretical values. 
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Figure 12.5 Adaptation of weighted SLA. (a) cos @;. (b) ||wi||. (c) Ai. (d) A; obtained after 
normalization. The dotted lines correspond to the theoretical eigenvalues. 


12.3.2 


From Fig. 12.5, we see that the convergence to smaller eigenvalues is slow. 
||w,;||s do not converge to unity, and accordingly ;’s do not converge to their 
theoretical values. By applying a normalization step at each iteration, we nor- 
malize ||w,||; the resulting A; = Tae is plotted in Fig. 12.5d. 

For weighted SLA, we also test for different y;. When all 7; are selected as 
unity, weighted SLA reduces to SLA. cos 6; could be any value between [—1, +1]. 
In case of SLA, ||w;||s converge to unity very rapidly, but \;s converge to some 
values different from their theoretical eigenvalues and also not in a descending 
order. 


Generalized Hebbian algorithm 


By combining Oja’s rule and the GSO procedure, Sanger proposed GHA for 
extracting the first J2 principal components [119]. GHA can extract the first Jo 
eigenvectors in an order of decreasing eigenvalues. 

GHA is given by [119] 
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GHA becomes a local algorithm by solving the summation term in (12.43) in a 
recursive form 


where &o(t) = 0. Usually, one selects 7; = 7 for all neurons 7, and accordingly 
the algorithm can be written in matrix form 


W(t +1) = Wit) — WELT [yy 0] + nary” (0), (12.45) 


where operator LT[-] selects the lower triangle of the matrix contained within. 
In GHA, the mth neuron converges to the mth principal component, and all the 
neurons tend to converge together. w; — c; and E [y] — rj, as t > oo. 

Both SLA and GHA employ implicit or explicit GSO to decorrelate the con- 
nection weights from one another. Weighted SLA performs well for extracting 
less-dominant components [96]. 

Traditionally, the learning rates of GHA are required to satisfy the Robbins- 
Monro conditions so that its convergence can be analyzed by studying the corre- 
sponding deterministic continuous-time equations. Based on analyzing the corre- 
sponding deterministic discrete-time equations, the global convergence of GHA 
is guaranteed by using the adaptive learning rates [86] 


Ç 
nE = R 
where the constant 0 < ¢ < 1. These learning rates converge to some positive 
constants, which speed up the algorithm considerably and also enable the con- 
vergence speed in all eigendirections to be approximately equal. 

In addition to popular SLA, weighted SLA and GHA, there are also some 
other Hebbian rule based PCA algorithms such as LEAP [11]. LEAP [11] is a 
local PCA algorithm for extracting all the Jz principal components and their 
corresponding eigenvectors. It performs GSO among all weights at each itera- 
tion. Unlike SLA and GHA, whose stability analyses are based on the stochastic 
approximation theory, the stability analysis of LEAP is based on Lyapunov’s 
first theorem, and 7 can be selected as a small positive constant. LEAP is capa- 
ble of tracking nonstationary processes, and can satisfactorily extract principal 
components even for ill-conditioned autocorrelation matrices [11]. 


k>0, (12.46) 


Example 12.2: Using the same data set given in Example 12.1, we conduct sim- 
ulation for GHA. We select the same learning rate n; = DOF’ 
t corresponds to the presentation of a new sample. The data set is repeated 10 
times, and the training samples are provided in a fixed deterministic sequence. 


where each time 
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Figure 12.6 Adaptation of GHA. (a) cos 9;. (b) ||w:||. (c) A;. The dotted lines in (c) correspond to the 
theoretical eigenvalues. 


We calculate the adaptations for ||w,||, \;, and the cosine of the angle 0; between 
w; and Ci. 

The performances of the algorithms are evaluated by averaging 50 random 
runs. The adaptations for a random run are shown in Fig. 12.6. Our empirical 
results show that for GHA a larger starting 7 can be used. 

From Fig. 12.6, we see that the convergence to smaller eigenvalues is slow. The 
strategy used in the adaptive learning algorithm [12] can be applied to make 
the algorithms converge to all the eigenvalues at the same speed. w; converges 
to the directions of c;. Unlike weighted SLA, all ||w;||s converge to unity, and 
accordingly A;s converge to their theoretical values. 


12.4 Least mean squared error-based PCA 
Existing PCA algorithms including Hebbian rule based algorithms can be derived 


by optimizing an objective function using the gradient-descent method. The least 
mean squared error (LMSE)-based methods are derived from the modified MSE 
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function 
t 
E(W) = So ph |en — WW? 2, ||’ 


ty=1 


(12.47) 





where 0<py<1 is a forgetting factor used for nonstationary observation 
sequences, and ¢ is the current instant. Many adaptive PCA algorithms actu- 
ally optimize (12.47) by using the gradient-descent method [137, 141] and the 
RLS method [4, 141, 90, 100, 102). 

The gradient-descent or Hebbian rule based algorithms are highly sensitive to 
parameters such as 77. It is difficult to choose proper parameters guaranteeing 
both small misadjustment and fast convergence. To overcome these drawbacks, 
applying RLS to minimization of (12.47) yields RLS-based algorithms such as 
adaptive principal components extraction (APEX) [73], Kalman-type RLS [4], 
projection approximation subspace tracking (PAST) [141], PAST with deflation 
(PASTd) [141], and robust RLS algorithm [100]. 

All RLS-based PCA algorithms exhibit fast convergence and high tracking 
accuracy, and are suitable for slowly varying nonstationary vector stochastic pro- 
cesses. All these algorithms correspond to a three-layer Jı-J2-Jı linear autoas- 
sociative network model, and they can extract all the Jz principal components 
in a descending order of the eigenvalues, where a GSO-like orthonormalization 
procedure is used. 

In [77], a regularization term y'w?P 5! w is added to (12.47), where W is a 
stack vector of W and Po is a diagonal Jı J2 x Jı J2 matrix. As t is sufficiently 
large, this term is negligible. This term ensures that the entries of W do not 
become too large. Without this term, some matrices in the recursive updating 
equations may become indefinite. Gauss-Seidel recursive PCA and Jacobi recur- 
sive PCA are derived in [77]. 

The least mean squared error reconstruction (LMSER) algorithm [137] is 
derived on the MSE criterion using the gradient-descent method. LMSER reduces 
to Oja’s algorithm when W(t) is orthonormal, namely, WT (t)W (t) = I. In this 
sense, Oja’s algorithm can be treated as an approximate stochastic gradient rule 
to minimize the MSE. LMSER has been compared with weighted SLA and GHA 
in [8]. The learning rates for all the algorithms are selected as 7(t) = Ż, where 
ô > 0 and 4 <a<1.A tradeoff is obtained: Increasing the values of y and 6 
results in a larger asymptotic MSE but faster convergence and vice versa, namely, 
the stability-speed problem. LMSER uses nearly twice as much computation as 
weighted SLA and GHA, for each update of the weight. However, it leads to a 
smaller asymptotic MSE and faster convergence for the minor eigenvectors [8]. 


PASTd algorithm 

PASTd [141] is a well-known subspace tracking algorithm updating the signal 
eigenvectors and eigenvalues. PASTd is based on PAST. Both PAST and PASTd 
are derived for complex-valued signals, which are common in signal processing. 
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At iteration t, PASTd is given as follows [141]: for i =1,..., J2, 





yilt) = w” (t — 1)x;,(t), (12.48) 

5i(t) = wd; (t — 1) + lyi(t)l?, (12.49) 

ailt) = wilt — 1)yilt), (12.50) 

wili) = wilt 1) + lm) = (0) 22. (12.51) 
wi+i(t) = x(t) — wi(t)yi(t), (12.52) 


where x(t) = a, and superscript x denotes the conjugate operator. 

w;(0) and 6;(0) should be suitably selected. W(0) should contain J2 orthonor- 
mal vectors, which can be calculated from an initial block of data or from arbi- 
trary initial data. A simple way is to set W(0) to the Jz leading unit vectors of 
the Jı x Jı identity matrix. 6;(0) can be set as unity. The choice of these initial 
values affects the transient behavior, but not the steady-state performance of 
the algorithm. w;(t) provides an estimate of the ith eigenvector, and ĝ;(t) is an 
exponentially weighted estimate of the corresponding eigenvalue. 

Both PAST and PASTd have linear computational complexity, that is, 
O (Ji J2) operations every update, as in the cases of SLA, GHA, LMSER, and the 
novel information criterion (NIC) algorithm [90]. PAST computes an arbitrary 
basis of the signal subspace, while PASTd is able to update the signal eigen- 
vectors and eigenvalues. Both the algorithms produce nearly orthonormal, but 
not exactly orthonormal, subspace basis or eigenvector estimates. If perfectly 
orthonormal eigenvector estimates are required, an orthonormalization proce- 
dure is necessary. 

Kalman-type RLS [4] combines RLS with the GSO procedure in a manner 
similar to that of GHA. Kalman-type RLS and PASTd are exactly identical if 
the inverse of the covariance of the ith neuron’s output, P;(¢), in Kalman-type 
RLSA is set as HO in PASTd. In the one-unit case, both PAST and PASTd 
are identical to Oja’s rule except that PAST and PASTd have a self-tuning 
learning rate OR Both PAST and PASTd provide much more robust estimates 
than EVD, and converge much faster than SLA. PASTd has been extended for 
tracking both the rank and the subspace by using information theoretic criteria 
such as AIC and MDL [142]. 

The constrained PAST algorithm [130] is for tracking the signal subspace 
recursively. Based on an interpretation of the signal subspace as the solution of a 
constrained minimization task, it guarantees the orthonormality of the estimated 
signal subspace basis at each update, hence avoiding the orthonormalization 
process. To reduce the computational complexity, fast constrained PAST that 
has a complexity of O (Jı J2) is introduced. For tracking the signal sources with 
abrupt change in their parameters, an alternative implementation with truncated 
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window is proposed. Furthermore, a signal subspace rank estimator is employed 
to track the number of sources. 

A perturbation-based fixed-point algorithm for subspace tracking [48] has fast 
tracking capability due to the recursive nature of the complete eigenvector matrix 
updates. They avoid deflation and the optimization of a cost function using gra- 
dients. It recursively updates the eigenvector and eigenvalue matrices simulta- 
neously with every new sample. 


Robust RLS algorithm 

The robust RLS algorithm [100] is more robust than PASTd. It can be imple- 
mented in a sequential or parallel form. Given the ith neuron, i = 1,..., J2, the 
sequential algorithm is given for all the patterns as [100] 





w;(t — 1) lw. DI 12.53) 

yilt) = Wi (t — 1)x:, 12.54) 

ĉit) = 5 yj (t)w;(t — 1), 12.55) 
wi(t) = wwi(t— 1) + [a — &:(t)] yi), (12.56) 
At) = me (12.57) 


where y; is the output of the ith hidden unit, and w;(0) is initialized as a 
small random value. By changing (12.55) into a recursive form, the robust RLS 
algorithm becomes a local algorithm. 

Robust RLS has the same flexibility as Kalman-type RLS [4], PASTd and 
APEX, in that increasing the number of neurons does not affect the previously 
extracted principal components. It naturally selects the inverse of the output 
energy to be the adaptive learning rate for the Hebbian rule. The Hebbian and 
Oja rules are closely related to the robust RLS algorithm by a suitable selection 
of the learning rates [100]. Robust RLS can also be derived from the adaptive 
learning algorithm [12] by using the first-order Taylor approximation [92]. 

Robust RLS is also robust to the error accumulation from the previous com- 
ponents, which exists in the sequential PCA algorithms like Kalman-type RLS 
and PASTd. Robust RLS converges rapidly, even if the eigenvalues extend over 
several orders of magnitude. According to the empirical results [100], robust RLS 
provides the best performance in terms of convergence speed as well as steady- 
state error, whereas Kalman-type RLS and PASTd have similar performance, 
which is inferior to that of robust RLS, and the adaptive learning algorithm [12] 
exhibits the poorest performance. 
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Other optimization-based PCA 


PCA can be derived by many optimization methods based on a properly defined 
objective function. This leads to many other algorithms, including gradient- 
descent based algorithms [9, 79, 80, 106, 151], the CG method [37], and the 
quasi-Newton method [63, 103]. The gradient-descent method usually converges 
to a local minimum. Second-order algorithms such as the CG and quasi-Newton 
methods typically converge much faster than first-order methods, but have a 
computational complexity of O (J? J2) per iteration. 

The infomax principle [79, 80] derives the principal subspace is by maximizing 
the mutual information criterion. Other examples of information criterion based 
algorithms are the NIC algorithm [90] and coupled PCA [92]. 

The NIC algorithm [90] is obtained by applying the gradient-descent method 
to maximize the NIC, which is a cost function very similar to the mutual infor- 
mation criterion, but integrates a soft constraint on weight orthogonalization. 
The NIC has a steep landscape along the trajectory from a small weight matrix 
to the optimum one; it has a single global maximum, and all the other sta- 
tionary points are unstable saddle points. At the global maximum, W yields 
an arbitrary orthonormal basis of the principal subspace, and thus the NIC 
algorithm is a PSA method. It can extract the principal eigenvectors when the 
deflation technique is incorporated. The NIC algorithm converges much faster 
than SLA and LMSER, and is able to globally converge to the PSA solution from 
almost any weight initialization. Reorthormalization can be applied so as to per- 
form true PCA [141, 90]. The NIC algorithm has a computational complexity of 
O (J? J2) for each iteration. By selecting a well-defined adaptive learning rate, 
the NIC algorithm also generalizes some well-known PSA/PCA algorithms such 
as PAST. For online implementation, an RLS version of the NIC algorithm has 
also been given in [90]. Weighted information criterion (WINC) [102] is obtained 
by adding a weight to the NIC to break its symmetry. The gradient-ascent based 
WINC algorithm can be viewed as an extended weighted SLA with an adaptive 
step size, leading to a much faster convergence speed. The RLS-based WINC 
algorithm provides fast convergence, high accuracy as well as low computational 
complexity. 

In PCA algorithms, the eigenmotion depends on the principal eigenvalue of 
the covariance matrix, while in MCA algorithms it depends on all the eigen- 
values [92]. Coupled learning rules can be derived by applying the Newton 
method to a common information criterion. In coupled PCA/MCA algorithms, 
both the eigenvalues and the eigenvectors are simultaneously adapted. The New- 
ton method yields averaged systems with identical speed of convergence in all 
eigendirections. The derived Newton-descent based PCA and MCA algorithms 
are respectively called nPCA and nMCA. The robust PCA algorithm [92], derived 
from nPCA, is shown to be closely related to robust RLS algorithm by applying 
the first-order Taylor approximation on the robust PCA. In order to extract mul- 
tiple principal components, one has to apply an orthonormalization procedure, 
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Figure 12.7 The PCA network with hierarchical lateral connections. 


12.5 


which is GSO, or its first-order approximation as in SLA, or deflation as in GHA. 
In the coupled learning rules, multiple principal components are simultaneously 
estimated by a coupled system of equations. In the coupled learning rules a first- 
order approximation of GSO is superior to the standard deflation procedure in 
terms of the orthonormality error and the quality of the eigenvectors and eigen- 
values generated [93]. An additional normalization step that enforces unit length 
of the eigenvectors further improves the orthonormality of the weight vectors 
[93]. 


Anti-Hebbian rule based PCA 


When the update of a synaptic weight is proportional to the correlation of the 
pre- and postsynaptic activities, but the direction of the change is opposite to 
that in the Hebbian rule, we get an anti-Hebbian learning rule. The anti-Hebbian 
rule can be used to remove correlations between units receiving correlated inputs 
[35, 115, 116]; it is inherently stable [115, 116]. 

Anti-Hebbian rule based PCA algorithms can be derived by using a J1-J2 
feedforward network with lateral connections among the output units [115, 116, 
35]. The lateral connections can be in a symmetrical or hierarchical topology. 
A hierarchical lateral connection topology is illustrated in Fig. 12.7, based on 
which the Rubner-Tavan PCA algorithm [115, 116] and APEX [72] are proposed. 
The lateral weight matrix U is an upper triangular matrix with the diagonal 
elements being zero. In [35], the local PCA algorithm is based on a symmetrical 
lateral connection topology. The feedforward weight matrix W is described in the 
preceding sections, and the lateral weight matrix U = [w1,...,uy,] is a J2 x Jo 
matrix, where u; = (Uri, U2i;..., U Jai)” includes all the lateral weights connected 
to neuron 7 and uji denotes the lateral weight from neuron j to neuron 7. 

The Rubner-Tavan PCA algorithm is based on the PCA network with hierar- 
chical lateral connection topology. The algorithm extracts the first Jy principal 
components in an order of decreasing eigenvalues. The output of the network is 
given by [115, 116] 


y=wietuly, i=1,...,J. (12.58) 
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Notice that uji = 0 for j > i and U is a J2 x J2 upper triangular matrix. 
The weights w; are trained by Oja’s rule, while the lateral weights u; are 
updated by the anti-Hebbian rule 


wilt + 1) = wi(t) + m(t)y(t) [xe — &(d)] , (12.59) 
&=W’'y, (12.60) 
ui(t + 1) = u(t) — my(t)y(t)- (12.61) 


This is a nonlocal algorithm. Typically, 7, = n2 > 0 is selected as a small number 
or by the Robbins-Monro conditions. During the training process, the outputs 
of the neurons are gradually uncorrelated and the lateral weights approach zero. 
The network should be trained until the lateral weights u; are below a specified 
level. The PCA algorithm proposed in [35] has the same form as the Rubner- 
Tavan PCA, but U is a full matrix. 


APEX algorithm 


The APEX algorithm is used to recursively and adaptively extract the princi- 
pal components [72]. Given i — 1 principal components, it can produce the ith 
principal component iteratively. The hierarchical structure of lateral connections 
among the output units serves the purpose of weight orthogonalization. This 
structure also allows the network to grow or shrink without retraining the old 
units. The convergence analysis of APEX is based on stochastic approximation 
theory, and APEX is proved to have the property of exponential convergence. 
Assuming that the correlation matrix C has distinct eigenvalues arranged 
in decreasing order as À > Ag >... > Ay, with the corresponding eigenvectors 


W1,.--,Wy,, the algorithm is given as [72, 73] 
y = W' a, (12.62) 
yi = wl æ + uly, (12.63) 
where y = (y1,... Yi)” is the output vector, u = (uri, uzi, poi uani) and 


W = [w1,...,wi_1] is the weight matrix of the first i — 1 neurons. These defi- 
nitions are for the first i neurons, which are different from their respective defi- 
nitions given in the preceding sections. The iteration is given by [72, 73] 


wi(t +1) = wi(t) + nlt) [ilt — ywl], (12.64) 


u(t +1) = u(t) = ni(k) [yi(t)y(t) + y Hul)] . (12.65) 


Equations (12.64) and (12.65) are respectively the Hebbian and anti-Hebbian 
parts of the algorithm. y; tends to be orthogonal to all the previous components 
due to the anti-Hebbian rule, also called the orthogonalization rule. 
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APEX can also be derived from the RLS method using the MSE criterion. 
Based on the RLS method, the optimum learning rate in terms of convergence 
speed is given by [73] 
= 1 _ mit — 1) 

Lice ky?) u-u- 1)’ 
where 0 < u < 1 is a forgetting factor, which induces an effective time window 
of size M = z+. The optimal learning rate can also be written as [72] 


mi(t) (12.66) 





re 
ee (12.67) 
miit) = Mo? : 
where o? = E [y?(t)] is the average output power or variance of neuron i. Accord- 
ing to [72], a? (t) — i, as t oo. A practical value of n; is selected by 
1 
i(t) = ; 12.68 
n= (12.68) 


since \;-; > A; and A; is not easy to get. 

Both sequential and parallel APEX algorithms are given in [73]. In parallel 
APEX, all the Jz output neurons work simultaneously. In sequential APEX, the 
output neurons are added one by one. Sequential APEX is more attractive in 
practical applications, since one can decide a desirable number of neurons during 
the learning process. APEX is especially useful when the number of required 
principal components is not known a priori. When the environment changes 
over time, a new principal component can be added to compensate for the change 
without affecting previously computed principal components. Thus, the network 
structure can be expanded if necessary. 

The stopping criterion can be that for each i the changes in w; and u are 
below a threshold. At this time, w; converges to the eigenvector of the correlation 
matrix C corresponding to the ith largest eigenvalue, and u converges to zero. 
The stopping criterion can also be the change of the average output variance 
o?(t) being sufficiently small. 

Most existing linear complexity methods including GHA, SLA, and PCA with 
the lateral connections require a computational complexity of O (Jı J2) per iter- 
ation. For recursive computation of each additional principal component, APEX 
requires O (Jı) operations per iteration, while GHA utilizes O (Jı J2) per itera- 
tion. 

In contrast to the heuristic derivation of APEX, a class of learning algorithms, 
called the w-APEX, is presented based on criterion optimization [34]. ~ can be 
selected as any function that guarantees the stability of the network. Some mem- 
bers in the class have better numerical performance and require less computa- 
tional effort compared to that of both GHA and APEX. 
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Figure 12.8 Sample images for face recognition. 


Example 12.3: Face recognition using PCA. PCA is a classical approach to 
face recognition. Each sample is a face image, normalized to the same size. In 
this example, we randomly select 60 samples from 30 persons, 2 samples for each 
person, for training. The testing set includes 30 samples, one for each person. 
The samples are of size 200 x 180 pixels, and they are reshaped in vector form. 
These sample are excerpted from Spacek’s Faces94 collection (http://cswww. 
essex.ac.uk/mv/allfaces/faces94.htm1). Some of the face samples are shown 
in Fig. 12.8. 

After applying PCA on the training set, the corresponding weights called eigen- 
faces are obtained. When a new sample is presented, the projection on the weights 
is derived. The projection of the presented sample is compared with those of all 
the training samples, and the training sample that has the minimum difference 
from the test sample is classified as the correct class. We test the trained PCA 
method using all the 30 testing samples, and the classification rate for this exam- 
ple is 100%. 


Image compression using PCA 

Image compression is performed to remove the redundancy in an image for stor- 
age and/or transmission purposes. Image compression is usually implemented 
by partitioning an image into many nonoverlaping 8 x 8 pixel blocks and then 
compressing them one by one. For example, if we compress each of the 64-pixel 
patch into 8 values, we achieve a compression ratio of 1 : 8. This work can be 
performed by using a PCA network. Based on the statistics of all the regions, 
one can use PCA to compress the image. Each region is concatenated into a 
vector, and all the vectors constitutes a training set. PCA is then applied to 
extract those prominent principal components, as such the image is compressed. 
Sanger used GHA for image compression |119]. Similar results using a three-layer 
autoassociative network with BP learning has been reported in [44]. 
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Figure 12.9 Network weights after training with the Lina image. 


PCA as well as LDA achieves the same results for an original data set and its 
orthonormally transformed version [16]. Thus, PCA and LDA can be directly 
implemented in DCT domain and the results are exactly the same as that 
obtained from spatial domain. For images compressed using DCT, such as in 
the JPEG or MPEG standard, PCA and LDA can be directly implemented in 
DCT domain so that inverse DCT transform can be avoided and computation 
conducted on a reduced data dimension. 


Example 12.4: We use the Lena image of 512 x 512 pixels with 256 gray levels 
for the training the PCA network. A linear 64-8 PCA network is used to learn 
the image. By 8 x 8 partitioning, we get 64 x 64 = 4,096 samples. Each of the 
output nodes is connected by 64 weights, denoted by an 8 x 8 mask. The training 
results for the 8 codewords are illustrated in Fig. 12.9, where the positive weights 
are shown as white, the negative weights as black, and zero weights as gray. 

After the learning algorithm converges, the PCA network can be used to code 
the image. An 8 x 8 block is multiplied by each of the eight weight masks, and 
this yields 8 coefficients. The reconstruction of the image from the coefficients 
can be conducted by multiplying the weights by those coefficients, and combining 
the reconstructed blocks into an image. The reconstructed image, illustrated in 
Fig. 12.10, is as good to the human eye as the network without any quantization. 

We now use the trained network to encode the family picture of 512 x 560, 
and the result is shown in Fig. 12.11. The reconstructed image is of good quality 
to the human eye. 

In the above two examples, a 8 x 8 block is encoded by only 8 coefficients. In 
consideration of the 64 x 8 codebook for the entire image, the coarse compression 
rate is close to 8 : 1 if the coefficients are quantized into 8 bits. By using entropy, 
each of the 8 coefficients can be further uniformly quantized using less bits, 
which are proportional to the logarithm of the variance of that coefficient over 
the whole image. For example, in [119] for an image the first two coefficients 
require 5 bits each, the third coefficient requires 3 bits, and the remaining five 
coefficients require 2 bits each, and a total of 23 bits are used to code each 8 x 8 
block, that is, a bit rate of 0.36 bits per pixel. This achieves a compression ratio 
of S&S = 22.26 to 1. 
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(a) (b) 
Figure 12.10 The Lena picture (a) and its restored version (b). 





(a) (b) 


Figure 12.11 The original picture (a) and its reconstructed version (a). 


12.6 Nonlinear PCA 


For non-Gaussian data distributions, PCA is not able to capture complex non- 
linear correlations, and nonlinear processing of the data is usually more effi- 
cient. Nonlinearities introduce higher-order statistics into the computation in an 
implicit way. Higher-order statistics, defined by cumulants or higher-than-second 
moments, are needed for good characterization of non-Gaussian data. For non- 
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Gaussian input data, nonlinear PCA permits extraction of higher-order com- 
ponents and provides a sufficient representation. Nonlinear PCA networks and 
learning algorithms can be classified into symmetric and hierarchical ones sim- 
ilar to those for PCA networks. After training, the lateral connections between 
output units are not needed, and the network becomes purely feedforward. 

Several popular PCA algorithms have been generalized into robust versions by 
applying a statistical-physics approach [140], where the defined objective func- 
tion can be regarded as a soft generalization of the M-estimator. Robust PCA 
algorithms are defined so that the optimization criterion grows less than quadrat- 
ically and the constraint conditions are the same as for PCA algorithms [65]. To 
derive robust PCA algorithms, the variance maximization criterion is gener- 
alized as E{o(w/a)] for the ith neuron, subject to hierarchical or symmetric 
orthonormality constraints, where o() is the /-estimator assumed to be a valid 
differentiable cost function that grows less than quadratically, at least for large 
values of x. Examples of such functions are a(x) = Incosh(x) and o(2) = |z]. 
Robust /nonlinear PCA can be obtained by minimizing the MSE that introduces 
nonlinearity using the gradient-descent procedure [65, 137]. Robust/nonlinear 
PCA algorithms have better stability properties than the corresponding PCA 
algorithms if the (odd) nonlinearity p(x) grows less than linearly, namely, 
Iy(e)| < |e [65]. 

In SOM, lateral inhibitory connections for output neurons are usually used to 
induce WTA competition among all the output neurons. It is capable of per- 
forming dimension reduction on the input. SOM is inherently nonlinear, and is 
viewed as a nonlinear PCA [113]. ASSOM can be treated as a hybrid of vector 
quantization and PCA. 

Principal curves [45] are nonlinear generalizations of the notion of the first 
principal component of PCA. A principal curve is a parameterized curve passing 
through the “middle” of a data cloud. 


Autoassociative network-based nonlinear PCA 


The MLP can be used to perform nonlinear dimension reduction and hence 
nonlinear PCA. Both the input and output layers of the MLP have Jı units, 
and one of its hidden layers, known as the bottleneck or representation layer, 
has Jo units, Jo < Jı. The network is trained to reproduce its input vectors. 
This kind of network is called the autoassociative MLP. After the network is 
trained, it performs a projection onto the J2-dimensional subspace spanned by 
the first J2 principal components of the data. The vectors of weights leading to 
the hidden units form a basis set that spans the principal subspace, and data 
compression therefore occurs in the bottleneck layer. Many applications of the 
MLP in autoassociative mode for PCA are available in the literature [6, 69]. 
The three-layer autoassociative J1-J2-Jı feedforward network or MLP network 
can also be used to extract the first Jə principal components of J;-dimensional 
data. If nonlinear activation functions are applied in the hidden layer, the net- 
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Figure 12.12 Architecture of Kramer’s nonlinear PCA network. 


work performs as a nonlinear PCA network. In the case of nonlinear units, local 
minima certainly appear. However, if linear units are used in the output layer, 
nonlinearity in the hidden layer is theoretically meaningless [6]. This is due to 
the fact that the network tries to approximate a linear mapping. 

Kramer’s nonlinear PCA network [69] is a five-layer autoassociative MLP, 
whose architecture is illustrated in Fig. 12.12. It has Jı input and Jı output 
nodes. The third layer has Jz nodes. y;, i =1,...,J2, is the ith output of the 
bottleneck layer. Nonlinear activation functions such as the sigmoidal functions 
are used in the second and fourth layers, while the nodes in the bottleneck 
and output layers usually have linear activation functions though they can be 
nonlinear. The network is trained by BP. Kramer’s nonlinear PCA fits a lower- 
dimensional surface through the training data. 

A three-layer MLP can approximate arbitrarily well any continuous function. 
The input, second and bottleneck layers constitute a three-layer MLP, which 
projects the training data onto the surface giving principal components. Like- 
wise, the combination of the bottleneck, fourth and output layers defines the 
surface that inversely maps the principal components into the training data. The 
outputs of the network are trained to approximate the inputs. After the network 
is trained, the nodes in the bottleneck layer give a lower-dimensional represen- 
tation of the inputs. Usually, data compression achieved in the bottleneck layer 
in such networks is somewhat better than that provided by the respective PCA 
solution [62]. This is actually a nonlinear PCA network. However, BP is prone 
to local minima and often requires excessive time for convergence. 

With very noisy data, having plentiful samples eliminates overfitting in nonlin- 
ear regression, but not in nonlinear PCA. To overcome this problem in Kramer’s 
nonlinear PCA, an information criterion is proposed for selecting the best model 
among multiple models with different complexity and regularization [54]. 

A hierarchical nonlinear PCA network composed of a number of independent 
subnetworks can extract ordered nonlinear principal components [118]. Each sub- 
network extracts one principal component, and can be selected as Kramer’s non- 
linear PCA network. The subnetworks are hierarchically arranged and trained. 

In contrast to autoassociative networks, the output pattern in heteroassocia- 
tive networks is not the same as the input pattern for each training pair. Het- 
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eroassociative networks develop arbitrary internal representations in the hidden 
layers to associate inputs to class identifiers, usually in the context of pattern 
classification. The rate of dimension reduction is not as high as that of autoas- 
sociative networks. 


Minor component analysis 


In contrast to PCA, MCA is to find the smallest eigenvalues and their corre- 
sponding eigenvectors of the autocorrelation matrix C of the signals. MCA is 
closely associated with the curve and surface fitting under the TLS criterion 
[138]. MCA provides an alternative solution to the TLS problem [39]. The TLS 
technique achieves a better optimal objective than the LS technique [40]. Both 
the solutions to the TLS and LS problems can be obtained by SVD. However, 
the TLS technique is computationally much more expensive than the LS tech- 
nique. MCA is useful in many fields including spectrum estimation, optimization, 
TLS parameter estimation in adaptive signal processing, and eigen-based bearing 
estimation. 

Minor components can be extracted in ways similar to that for principal com- 
ponents. A simple idea is to reverse the sign of the PCA algorithms, since in 
many algorithms principal and minor components correspond to the maximum 
and minimum of a cost function, respectively. However, this idea does not work 
in general [96]. 


Extracting the first minor component 


The anti-Hebbian learning rule and its normalized version can be used for MCA 
[139]. The anti-Hebbian algorithm tends rapidly to infinite magnitudes of the 
weights. The normalized anti-Hebbian algorithm leads to better convergence, but 
it may also lead to infinite magnitudes of weights before the algorithm converges. 
To avoid this, one can renormalize the weight vector at each iteration. The 
constrained anti-Hebbian learning algorithm [38, 39] has a simple structure, and 
requires a low computational complexity per update. It can be used to solve the 
TLS parameter estimation [39], and has been extended for complex-valued TLS 
problem [38]. However, as in the anti-Hebbian algorithm, the convergence of the 
magnitudes of the weights cannot be guaranteed unless the initial weights take 
special values. 

The total least mean squares (TLMS) algorithm [30] is a random adaptive 
algorithm for extracting the minor component, which has an equilibrium point 
under persistent excitation conditions. The TLMS algorithm requires about 4J; 
multiplications per iteration, which is twice the complexity of the LMS algo- 
rithm. An adaptive step-size learning algorithm [99] is derived for extracting the 
minor component by introducing information criterion. The algorithm globally 
converges asymptotically to a stable equilibrium point, which corresponds to the 
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minor component and its corresponding eigenvector. The algorithm outperforms 
TLMS in terms of both convergence speed and estimation accuracy. 

In [107, 108], learning algorithms for estimating minor component from input 
signals are proposed, and their dynamics are analyzed via a deterministic 
discrete-time method. Some sufficient conditions are obtained to guarantee con- 
vergence of the algorithm. A convergent algorithm is given by [108] 


_ ¥tw(t) 

wT (t)w(t) 
where 7 < 0.5/1, à >A2>...>An>0 are all the eigenvalues of C, 
\|w(0) ||? < 0.5/n, and wT (0)un 40, vp is the eigenvector associated with the 
smallest eigenvalue of C. w(t) will converge to the minor component of the 
input data. 

A class of self-stabilizing MCA algorithms, which is proved to be a globally 
asymptotically convergent using Lyapunov’s theorem, is given by [148] 


w(t +1) = w(t) — n(t)y(6) Jae — Tel) (12.70) 


w(t +1) = w(t) — 7 |w (tHw(t)ylt)x (12.69) 


for integer a > 0. It reduces to normalized Oja for a = 0 [96, 139], to the one 
given in [147] for a = 1, and to a modified Oja’s algorithm [33] for a = 2. To 
improve the convergence speed, select large a if ||w(t)|| > 1, and set a = 1 if 


lw®ll < 1. 


Self-stabilizing minor component analysis 


A general algorithm that can extract, in parallel, principal and minor eigenvec- 
tors of arbitrary dimensions is derived based on the natural-gradient method in 
[14]. The difference between PCA and MCA lies in the sign of the learning rate. 
The MCA algorithm can be written as [14] 


W(t +1) = Wit) — n [ey QW? (W(t) -Weu 0271) 


At initialization, W7(0)W(0) is required to be diagonal. It suffers from a 
marginal instability, and thus it requires intermittent normalization such that 
wl] = 1 (26). 

A self-stabilizing MCA algorithm is given in [26] as 


W(t +1) = W(t) — n [ey OWT (W(t) WT (W(t) — W(t)y (ty (0) 
(12.72) 
Algorithm (12.72) is self-stabilizing, such that none of ||w,(t)|| deviates signifi- 
cantly from unity. It diverges for PCA when —7 is changed to +n. 

A class of self-stabilizing MCA algorithms is obtained, and its convergence 
and stability are analyzed via a deterministic discrete-time method [68]. Some 
sufficient conditions are obtained to guarantee the convergence of these learn- 
ing algorithms. These self-stabilizing algorithms can efficiently extract the minor 
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component, and they outperform several self-stability MCA algorithms, a mod- 
ified Oja algorithm [33], and algorithm (12.69) [108]. 


Oja-based MCA 


Oja’s minor subspace analysis (MSA) algorithm can be formulated by reversing 
the sign of the learning rate of SLA for PSA [97]. However, Oja’s MSA algorithm 
is known to diverge [96, 26, 2]. 

The orthogonal Oja algorithm consists of Oja’s MSA plus an orthogonalization 
of W(t) at each iteration [1] 


W/ (t)W(t) =I. (12.73) 


In this case, Oja’s MSA [97], (12.71) [14], and (12.72) [26] are equivalent. A 
Householder transform based implementation of the MCA algorithm is given in 
[1]. Orthogonal Oja is numerically very stable. By reversing the sign of 7, we 
extract J2 principal components. 

Normalized Oja [2] is derived by optimizing the MSE subject to an approxi- 
mation to the orthonormal constraint (12.73). This leads to the optimal learning 
rate. Normalized orthogonal Oja is an orthogonal version of normalized Oja such 
that (12.73) is perfectly satisfied [2]. Both algorithms offer, as compared to SLA, 
faster convergence, orthogonality, and better numerical stability with a slight 
increase in computational complexity. By switching the sign of 7 in given learn- 
ing algorithms, both normalized Oja and normalized orthogonal Oja can be used 
for the estimation of minor and principal subspaces of a vector sequence. 

Oja’s MSA, (12.71), (12.72), orthogonal Oja, normalized Oja, and normalized 
orthogonal Oja all have a complexity of O (Ji J2) [1, 26]. Orthogonal Oja, nor- 
malized Oja, and normalized orthogonal Oja require less computation load than 
algorithms (12.71) and (12.72) [1, 2]. 


Other algorithms 


In [15], the proposed MCA algorithm for extracting multiple minor components 
utilizes the idea of sequential addition, and a conversion method between MCA 
and PCA is also discussed. 

Based on a generalized differential equation for the generalized eigenvalue 
problem, a class of algorithms can be obtained for extracting the first princi- 
pal or minor component by selecting different parameters and functions [154]. 
Many PCA algorithms [95, 151, 132] and MCA algorithms [132] are special cases 
of this class. All the algorithms of this class have the same order of convergence 
speed and are robust to implementation error. 

A rapidly convergent quasi-Newton method is applied to extract multiple 
minor components in [89]. The algorithm has a complexity of O (J2J?) but with 
quadratic convergence. The algorithm makes use of the implicit orthogonaliza- 
tion procedure that is built into it through an inflation technique. 
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The adaptive minor component extraction (AMEX) algorithm [101] extends 
the work in [99] to extract multiple minor components corresponding to dis- 
tinct eigenvalues. AMEX comes from the TLMS algorithm [30]. Unlike TLMS, 
AMEX is developed by unconstrained minimization of an information criterion 
using the gradient search approach. The criterion has a unique global minimum 
at the minor subspace and all other equilibrium points are saddle points. The 
algorithm automatically performs multiple minor component extraction in par- 
allel without an inflation procedure. AMEX has the merit that increasing the 
number of the desired minor components does not affect the previously extracted 
minor components. 

Several minor component algorithms, including (12.70), normalized Oja [96, 
139], and an algorithm given in [101], are extended to those for tracking multiple 
minor components or the minor subspace in [33]. 

In [149, 81], simple neural network models, described by differential equations, 
calculate the largest and smallest eigenvalues as well as their corresponding eigen- 
vectors of any real symmetric matrix. 


Constrained PCA 


When certain subspaces are less preferred than others, this yields the con- 
strainted PCA [71]. The optimality criterion for constrainted PCA is variance 
maximization, as in PCA, but with an external subspace othogonality constraint 
that extracted principal components are orthogonal to some undesired subspace. 

Given a J)-dimensional stationary stochastic input vector a, and an l- 
dimensional (l < Jı) constraint vector q(t), such that 


a(t) = Qa, (12.74) 


where Q is an orthonormal constraint matrix, spanning an undesirable subspace 
L. The task is to find, in the principal component sense, the most representative 
J-dimensional subspace L72 that is constrained to be orthogonal to £, where 
1+ Jo < Jı. That is, we are required to find the optimal linear transform 


y(t) = Wi a, (12.75) 
where W is orthonormal, such that 
Eopoa =E [Ilse - 21?] =E (læ - Wyl?] (12.76) 
is minimized subject to 
QW =0. (12.77) 
The optimal solution to the constrained PCA problem is given by [71, 73] 
W* = [či ... Gy], (12.78) 
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where C;,i = 1,..., J2, are the principal eigenvectors of the skewed autocorrela- 
tion matrix 


C. = (I- QQ’) C. (12.79) 
At the optimum, Ecpca takes its minimum 
no 
Eépca = JO Xi (12.80) 
i=J2+1 


where Ae i=1,...,J1, are the eigenvalues of C, in descending order. Like PCA, 
the components now maximize the output variance, but under the additional 
constraint (12.77). 

PCA usually obtains the best fixed-rank approximation to the data in the LS 
sense. On the other hand, constrainted PCA allows specifying metric matrices 
that modulate the effects of rows and columns of a data matrix. This actually 
is weighted LS estimation. Constrainted PCA first decomposes the data matrix 
by projecting the data matrix onto the spaces spanned by matrices of external 
information and then applies PCA to decomposed matrices, which involves gen- 
eralized SVD. APEX can be applied to recursively solve the constrained PCA 
problem [73]. 

Given a sample covariance matrix, we examine the problem of maximizing the 
variance accounted for by a linear combination of the input variables while con- 
straining the number of nonzero coefficients in this combination. This is known 
as sparse PCA. The problem is to find sparse factors that account for a maxi- 
mum amount of variance. A semidefinite relaxation to this problem is formulated 
in [21] and a greedy algorithm that computes a full set of good solutions for all 
target numbers of nonzero coefficients is derived. 


Sparse PCA 


One would like to express as much variability in the data as possible, using 
components constructed from as few variables as possible. There are two kinds 
of sparse PCA: sparse loading PCA and sparse variable PCA. 

Sparse variable PCA removes some measured variables completely by simul- 
taneously zeroing out all their loadings. In [58] a sparse variable PCA method 
that is based on selecting a subset of measured variables with largest sample 
variances and then performing a PCA on the selected subset, is given. Sparse 
variable PCA is capable of huge additional dimension reduction beyond PCA. 

Sparse loading PCA focuses on zeroing out individual PCA loadings but keeps 
all the variables. One can simply set to zero the PCA loadings which are in 
absolute value smaller than some threshold constant [7]. SCoTLASS [59] directly 
puts Lı constraints on the PCA loadings. A greedy sparse loading PCA algorithm 
is developed in [21]. 
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In an L; penalized likelihood approach to sparse variable PCA [128], smooth 
approximation of the Lı penalty is used and the optimization by geodesic steep- 
est descent on a Stiefel manifold is carried out. A vector Lo penalized likelihood 
approach to sparse variable PCA is considered; the proposed penalized EM algo- 
rithm, in an Lo setting, leads to a closed-form M-step, and a convergence analysis 
is provided in [129]. Thus one does not need to approximate the vector Lo penalty 
and does not need to use Stiefel optimization. 

In an approach to sparse PCA [60], two single-unit and two block optimization 
formulations of the sparse PCA problem are proposed, aimed at extracting a sin- 
gle sparse dominant principal component of a data matrix, or more components 
at once, respectively. The dimension of the search space is decreased enormously 
if the data matrix has many more columns (variables) than rows. 

Sparse solutions to a generalized EVD problem is obtained by solving the gen- 
eralized EVD problem while constraining the cardinality of the solution. Instead 
of relaxing the cardinality constraint using a Lj-norm approximation, a tighter 
approximation that is related to the negative log-likelihood of a Student-t dis- 
tribution is considered [122]. The problem is solved as a sequence of convex pro- 
grams by invoking the majorization-minimization method. The resulting algo- 
rithm is proved to exhibit global convergence behavior. Three specific examples 
of sparse generalized EVD problems are sparse PCA, sparse CCA and sparse 
LDA. The majorization-minimization method can be thought of as a generaliza- 
tion of the EM algorithm. 

Compressive-projection PCA [36] is driven by projections at the sensor onto 
lower-dimensional subspaces chosen at random, while the decoder, given only 
these random projections, recovers not only the coefficients associated with PCA, 
but also an approximation to PCA basis itself. This makes possible an excel- 
lent dimension-reduction performance in an light-encoder/heavy-decoder system 
architecture, particularly in satellite-borne remote-sensing applications. 


Localized PCA, incremental PCA and supervised PCA 


Localized PCA 
The nonlinear PCA problem can be overcome using localized PCA. The data 
space is partitioned into a number of disjunctive regions, followed by the esti- 
mation of the principal subspace within each partition by linear PCA. The dis- 
tribution is collectively modeled by a collection or a mixture of linear PCA 
models, each characterizing a partition. Localized PCA is different from local 
PCA. In local PCA, the update at each node makes use of only local informa- 
tion. Localized PCA provides an efficient means to decompose high-dimensional 
data-compression problems into low-dimensional ones. 

VQ-PCA [62] is a locally linear model that uses vector quantization to define 
the Voronoi regions for localized PCA. The algorithm builds a piecewise linear 
model of the data. It performs better than the global models implemented by 
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the linear PCA model and Kramer’s nonlinear PCA, and is significantly faster 
than Kramer’s nonlinear PCA. The localized PCA method is commonly used in 
image compression. An image is often transformation-coded by PCA, followed 
by coefficient quantization. 

An online localized PCA algorithm [91] is developed by extending the neu- 
ral gas method. Instead of the Euclidean distance measure, a combination of a 
normalized Mahalanobis distance and the squared reconstruction error guides 
the competition between the units. The unit centers are updated as in neural 
gas, while subspace learning is based on the robust RLS algorithm. ASSOM is 
another localized PCA for unsupervised extraction of invariant local features 
from the input data. It associates a subspace instead of a single weight vector to 
each node of SOM. 


Incremental PCA 

Incremental PCA algorithm [42] can update eigenvectors and eigenvalues incre- 
mentally. It is applied to a single training sample at a time, and the intermediate 
eigenproblem must be solved repeatedly for every training sample. Chunk incre- 
mental PCA [104] processes a chunk of training samples at a time. It can reduce 
the training time effectively as compared with incremental PCA unless the num- 
ber of input attributes is too large. It can obtain major eigenvectors with fairly 
good approximation. In chunk incremental PCA, the update of an eigenspace 
is completed by performing single eigenvalue decomposition. The SVD updat- 
ing based incremental PCA algorithm [155] gives a close approximation to the 
batch-mode PCA method, and the approximation error is proved to be bounded. 

Candid covariance-free IPCA [134] is a fast incremental PCA algorithm used 
to compute the principal components of a sequence of samples incrementally 
without estimating the covariance matrix. It is motivated by the concept of 
statistical efficiency (the estimate has the smallest variance given the observed 
data). Some links between incremental PCA and the development of the cerebral 
cortex are discussed in [134]. 

In a probabilistic online algorithm for PCA [131], in each trial the current 
instance is centered and projected onto a probabilistically chosen low dimensional 
subspace. The total expected quadratic compression loss of the online algorithm 
minus the total quadratic compression loss of the batch algorithm is bounded by 
a term whose dependence on the dimension of the instances is only logarithmic. 
The running time is O(n”) per trial, where n is the dimension of the instances. 


Other PCA methods 

Like supervised clustering, supervised PCA [13] is achieved by augmenting the 
input of PCA with the class label of the dataset. Class-augmented PCA [105] is 
a supervised feature extraction method; it is composed of processes for encoding 
the class information, augmenting the encoded information to data, and extract- 
ing features from class-augmented data by applying PCA. 
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PCA can be generalized to distributions of the exponential family [19]. This 
generalization is based on a generalized linear model and criterion functions 
using the Bregman distance. This approach permits hybrid dimension reduction 
in which different distributions are used for different attributes of the data. 


Complex-valued PCA 


Complex PCA is a generalization of PCA in complex-valued data sets [50]. It has 
been widely applied to complex-valued data and two-dimensional vector fields. 
Complex PCA employs the same neural network architecture as that for PCA, 
but with complex weights. The objective functions for PCA can also be adapted 
to complex PCA by changing the transpose into the Hermitian transpose. For 
example, for complex PCA, one can minimize the MSE function 


N 
1 H 2 
P= > |z: - WW” zil], (12.81) 
where z;, i = 1,..., N, are the input complex vectors. By minimizing (12.81), 


the first complex principal component is extracted. 

Complex-domain GHA [152] extends GHA for complex principal component 
extraction. Complex-domain GHA is very similar to GHA except that complex 
notations are introduced. The updating rule for w; is [152]: 


w,(n + 1) = w;(n) + p(n)conjly;(n)][a(n) — yj(m)w;(n) — 2 yi(n)wi(n), 
(12.82) 
yj(n) = wi! (n)æ(n), (12.83) 


where H denotes the Hermitian transpose. With any initial wj, it is proved to 
converge to the jth normalized eigenvector of C = E[xa"]. 

In [111], a complex-valued neural network model is developed for nonlinear 
complex PCA. Nonlinear complex PCA has the ability to extract nonlinear 
features missed by PCA. It uses the architecture of Kramer’s nonlinear PCA 
network, but with complex weights and biases. For a similar number of model 
parameters, it captures more variance of a data set than the alternative real 
approach, where each complex variable is replaced by two real variables and 
applied to Kramer’s nonlinear PCA. The complex hyperbolic tangent tanh(z) 
with |z| < 4 is selected as the transfer function. Complex-valued BP or quasi- 
Newton method can be used for training. 

Both PAST and PASTd are, respectively, the PSA and PCA algorithms 
derived for complex-valued signals [141]. Complex-valued APEX [17] actually 
allows extracting a number of principal components from a complex-valued sig- 
nal. The robust complex PCA algorithms have also been derived in [18] for 
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hierarchically extracting principal components of complex-valued signals based 
on a robust statistics based loss function. 

As far as complex MCA is concerned, the constrained anti-Hebbian learning 
algorithm [38, 39] has been extended for the complex-valued TLS problem [38]. 


Two-dimensional PCA 


Because of the small-sample-size problem for image representation, PCA is prone 
to be overfitted to the training set. In PCA, an m x n image X should be mapped 
into a high-dimensional mn x 1 vector x in advance. Two-dimensional PCA can 
address these probems. In two-dimensional PCA, an image covariance matrix is 
constructed directly using the original image matrices instead of the transformed 
vectors, and its eigenvectors are derived for image-feature extraction. 

For m x n images, the size of the image covariance (scatter) matrix using 
2DPCA [143] is n x n, whereas for PCA the size is mn x mn. 2DPCA evaluates 
the covariance matrix more accurately than PCA does. 2DPCA is a row-based 
PCA, and it only reflects the information between rows. It treats an image as 
m row vectors of dimension 1 x n and performs PCA on all row vectors in the 
training set. In 2DPCA, the actual vector dimension is n and the actual sample 
size is mN, where n < mN. Thus, the small-sample-size problem is resolved. 
Despite its advantages, 2DPCA still suffers from the high feature dimension 
problem. 

Diagonal PCA [153] improves 2DPCA by defining the image scatter matrix as 
the covariances between the variations of the rows and those of the columns of 
the images, and is more accurate than PCA and 2DPCA. (PC)?A [135] adopts 
image preprocessing plus PCA. 

In modular PCA [41], an image is divided into nı subimages and PCA is 
performed on all these subimages. Since modular PCA divides an image into 
a number of subimages, the actual vector dimension in modular PCA will be 
much lower than in PCA. The number of training vectors used in modular PCA 
is much higher than the number used in PCA. Thus, modular PCA can be used 
to solve the overfitting problem. The feature dimension increases as the number 
of subimages is increased. 

2DPCA and modular PCA both solve the overfitting problems by reducing the 
dimension and by increasing the training vectors yet introduce the high feature 
dimension problem. 

Bidirectional PCA [157] reduces the dimension in both column and row direc- 
tions for image feature extraction. The feature dimension of BD-PCA is much 
less than that of 2DPCA. Bidirectional PCA is a straightforward image pro- 
jection technique where a ko X krow feature matrix Y of an m x n image 
X(keor K M, krow Kn) can be obtained by 


Y = WLX Wow, (12.84) 


fe} 
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where W,) is the column projector and W, ow is the row projector. 2DPCA can 
be regarded as a special bidirectional PCA with W,.) being an m x m identity 
matrix. Bidirectional PCA has to be performed in batch mode. The scatter 
matrices of bidirectional PCA are formulated as the sum of K (sample size) 
image covariance matrices, making incremental learning directly on the scatters 
impossible. With the concepts of tensor, kK-mode unfolding and matricization, 
an SVD-revision based incremental learning method of bidirectional PCA [112] 
gives a close approximation to bidirectional PCA, but using less time. 

PCA-L1 [75] is a fast and robust Ly-norm based PCA method. L)-norm based 
two-dimensional PCA (2DPCA-L1) [78] is a two-dimensional generalization of 
PCA-L1 [75]. It avoids computation of the eigendecomposition process and its 
iteration step is easy to perform. The generalized low-rank approximation of 
matrices (GLRAM) [146] is another two-dimensional PCA method. 

The uncorrelated multilinear PCA algorithm [85] is used for unsupervised sub- 
space learning of tensorial data. It is a multilinear extension of PCA. Through 
successive variance maximization, uncorrelated multilinear PCA seeks a tensor- 
to-vector projection that captures most of the variation in the original tensorial 
input while producing uncorrelated features. This work offers a way to systemat- 
ically determine the maximum number of uncorrelated multilinear features that 
can be extracted by the method. The method not only obtains features that max- 
imize the variance captured, but also enforces a zero-correlation constraint, thus 
extracting uncorrelated features. It is the only multilinear extension of PCA that 
can produce uncorrelated features in a fashion similar to that of PCA, in con- 
trast to other multilinear PCA extensions, such as 2DPCA [143] and multilinear 
PCA (MPCA) [84]. 


Generalized eigenvalue decomposition 


Generalized EVD is a statistical tool that is extremely useful in feature extrac- 
tion, pattern recognition as well as signal estimation and detection. The gener- 
alized EVD problem is to find a pair (A, æ) such that 


Rix = AR22, (12.85) 


where R; € R"*",Ro € R"*", A€ R. Generalized EVD aims to find multi- 
ple principal or minor generalized eigenvectors of a positive-definite symmetric 
matrix pencil (R1, R2). PCA, CCA and LDA are specific instances of generalized 
EVD problems. 

The generalized EVD problem involves the matrix equation 


Riwi = AiRow;, (12.86) 


where R1, Ro € R”, and \;, wi, i = 1,..., Jo, are, respectively, the ith gen- 
eralized eigenvalue and its corresponding generalized eigenvector. For real sym- 
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metric and positive-definite matrices, all the generalized eigenvectors are real 
and the corresponding generalized eigenvalues are positive. 
Generalized EVD achieves simultaneous diagonalization of R; and R2 


W'RiIW=A, WTRW =I, (12.87) 


where W = [w,...,w,] and A = diag (Ay,...,AJ,). Typically, Ri and Rz are, 
respectively, the full covariance matrices of zero-mean stationary random signals 
£1, 2 E€ R”. In this case, iterative generalized EVD algorithms can be obtained 
by using two PCA steps. Alternatively, generalized EVD is also referred to as ori- 
ented PCA [25]. When Rz becomes an identity matrix, generalized EVD reduces 
to PCA. 

Any generalized eigenvector w; is a stationary point of the criterion function 


w' Riw 
Ecrvp(w) = wTRow 
The LDA problem is a typical generalized EVD problem. The three-layer 
LDA network [87] is obtained by the concatenation of two Rubner-Tavan PCA 
subnetworks. Each subnetwork is trained by the Rubner-Tavan PCA algorithm 
[116, 115]. Based on the Rubner-Tavan PCA network architecture, online local 
learning algorithms for LDA and generalized EVD are given in [22]. 
Generalized EVD methods for extracting multiple principal generalized eigen- 
vectors from two sequences of sample vectors are typically adaptive ones for 
online implementation. These include the LDA-based gradient-descent algorithm 
[10, 136], a gradient-based adaptive algorithm for estimating the largest princi- 
pal generalized eigenvector [94], a quasi- Newton type generalized EVD algorithm 
[88, 144, 145], an RLS-like fixed-point generalized EVD algorithm [110], error- 
correction learning [22], and Hebbian learning [22]. Fixed-point algorithms do not 
require any external step-size parameters like the gradient-based methods. These 
algorithms may be sequential algorithms or parallel ones. As in case of PCA algo- 
rithms, sequential algorithms also use a deflation procedure. This causes error 


(12.88) 


propagation, leading to slow convergence of minor generalized eigenvectors. 

Implementation of generalized EVD algorithms can employ a neural network 
architecture, such as a two-layer linear heteroassociative network [10] or a lateral 
inhibition network [136]. A recurrent network with invariant B-norm proposed in 
[125] computes the largest or smallest generalized eigenvalue and the correspond- 
ing eigenvector of any symmetric positive pair, which can be simply extended 
to compute the second largest or smallest generalized eigenvalue and the corre- 
sponding eigenvector. 


Singular value decomposition 


SVD is among the most important tools in numerical analysis for solving a wide 
scope of approximation problems in signal processing, model reduction and data 
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compression. The crosscorrelation asymmetric PCA/MCA networks can be used 
to extract the singular values of the crosscorrelation matrix of two stochastic 
signal vectors, or to implement SVD of a general matrix. 

The SVD updating algorithm [76] provides an efficient way to carry out SVD 
of a larger matrix. By exploiting the orthonormal properties and block structure, 
the SVD computation of [A, B] can be efficiently carried out by using the smaller 
matrices and SVD of the smaller matrix. 

Tucker decomposition [127] decomposes a three-dimensional signal directly 
using three-dimensional PCA, which is a multilinear generalization of SVD to 
multidimensional data. For video frames, this higher-order SVD decomposes the 
dynamic texture as a multidimensional signal (tensor) without unfolding the 
video frames on column vectors. This is a more natural and flexible decomposi- 
tion, since it permits us to perform dimension reduction in the spatial, tempo- 
ral, and chromatic domains between the pixels of the video sequence, leading to 
an important decrease in model size, while standard SVD allows for temporal 
reduction only. For comparable synthesis quality, higher-order SVD requires, on 
average, five times less parameters than standard SVD. The analysis part is more 
expensive, but the synthesis has the same cost as that in the existing algorithms 
[20]. 


Crosscorrelation asymmetric PCA networks 


Given two sets of random vectors with zero mean, x; € R”! and y, € R”?, the 
crosscorrelation matrix is defined by 


Coy =E [ney7] = >> oi? 0 (12.89) 
i=l 


where g; > 0 is the ith singular value, v? and v7 are its corresponding left and 
right singular vectors, and n = min {n1,n2}. The crosscorrelation asymmetric 
PCA network is a method for extracting multiple principal singular components 
of Czy. 

The crosscorrelation asymmetric PCA network consists of two sets of neurons 
that are laterally hierarchically connected [23]. The asymmetric PCA network, 
shown in Fig. 12.13, is composed of two hierarchical PCA networks. x € R”! and 
y € R”? are input vectors, a,b € R™ are the output vectors of the hidden layers. 
The nı x m matrix W = |w; ...w,,] and the ng x m matrix W = [W .. . Wm] 
are the feedforward weights, while U = [u] .. . um] and U = [u1 ... Wm] are the 


na x m matrices of lateral connection weights, where u; = (Ui, --- , Umi)”, Ti = 
(Tiis. Tima) 5 and m < min {n1, n2}. This model performs SVD of Czy. 
The network has the following relations: 
a=W's, b=W y, (12.90) 
where a = (a1, ...,am)” and b = (b1, ..., bm)”. 
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Figure 12.13 Architecture of the crosscorrelation asymmetric PCA network. 


The objective function for extracting the first principal singular value of the 
covariance matrix is given by 
Eapca (w, W) = meu) = w Cay (12.91) 
lwl] ll wll w| 
It is an indefinite function. When y = a, it reduces to PCA. After the principal 
singular component is extracted, a deflation transformation is introduced to nul- 
lify the principal singular value so as to make the next singular value principal. 
Thus, C,, in the criterion (12.91) can be replaced by one of the following three 
transformed forms so as to extract the (i + 1)th principal singular component 





i+ i T 
cg = cH (1-070), (12.92) 
CY) = (1- vf (wy) C9, (12:983) 
CU) = (1- vf (oF)") CY (1- o? 07)) (12.94) 


for i = 1,...,m — 1, where cf) = C,,. These are, respectively, obtained by the 
transforms on the data: 


£g, y =y- v? (oF) y, (12.95) 

£ x o? (v!)’ z, yoy, (12.96) 

zr- gr- v? (v2)? £, y — y — v? (vt) y. (12.97) 

Using a deflation transformation, the two sets of neurons are trained with the 
cross-coupled Hebbian learning rules, which, for j = 1,...,m, are given by [23] 
w,(t +1) = w,(t) + n [e(t) —w,(t)aj(t)] 05 (0), (12.98) 

T(t +1) = 5; (t) + nly (t) — T; (t)b; (6)] a(t), (12.99) 


where the learning rate 7 > 0 is selected as a small constant or according to the 
Robbins-Monro conditions, 


j-l1 j-l1 
a; = aj — X uya, bi = bj = X Tibi, (12.100) 
i=1 i=1 
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a-wie, b=wWy, i=1,...,j (12.101) 
and the lateral weights should be equal to 


i= ww, Ty WD, i=l1,...,j—1. (12.102) 


A set of lateral connections among the units is called a lateral othogonaliztion 
network. Hence, U and U are upper triangular matrices. The stability of the 
algorithm is proved based on Lyapunov’s second theorem. 

A local algorithm for calculating u;; and Wj; is derived by premultiplying 
(12.98) with w?, premultiplying (12.99) with Ww? and then employing (12.102) 


=i? 


ug (t+ 1) = ug (€) + n [a;(t) — Ty (Ca, (0) 0; (8), (12.103) 





We can select u;;(0) = w? (0)w,; (0) and T;; (0) = w? (0)w; (0). However, this ini- 
tial condition is not critical to the convergence of the algorithm. 

w; and W; approximate the ith left and right principal singular vectors of Czy, 
respectively, and g; approximates its corresponding criterion Eapca, as t —> co; 
that is, the algorithm extracts the first m principal singular values in descend- 
ing order and their corresponding left and right singular vectors. Like APEX, 
asymmetric PCA incrementally adds nodes without retraining the learned nodes. 
Exponential convergence has been observed by simulation [23]. 

When m in the asymmetric PCA network is selected as unity, the principal 
singular component of Cz, can be extracted by modifying the cross-coupled 
Hebbian rule [29] 


w(t + 1) = w, (£) +1 [OL - la OI? w) (12.105) 





T(t + 1) = T(t) +7 [ay - AOI? TO] . (12.106) 


This algorithm can efficiently extract the principal singular component, and is 
proved to have global asymptotic convergence. When y(t) = a(t), it reduces to 
Oja’s PCA algorithm [95]. 


Extracting principal singular components for nonsquare matrices 


When the crosscorrelation matrix is replaced by a general nonsquare matrix, 
(12.105) and (12.106) can be directly transformed into the algorithm for extract- 
ing the principal singular component of a general matrix A € R™*"2 [29] 


w,(t+1) = w(t) +n [Ami (t) —|lwr(t)||? wi], (12-107) 








T(t + 1) = Wilt) +n [AT w(t) — [FHP M(H]. (12-108) 
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Using (12.107) and (12.108) and a deflation transformation, one can extract 
multiple principal singular components for the nonsquare matrix A [29] 


w,(t +1) = w,(t) +n [Ami (e) = las Ol? ws (0)] , i=1,...,m, (12.109) 
T(t +1) =, (t) + n: [AP w,(t) — (ie)? Wi(O)] i=1,...,m, (12.110) 


Ai = A; - w0, i=1,...,m—-1, (12.111) 


where A; = A, and the learning rates are suggested to be ņ; < E As t > œ, 

w,(t) and W;(t) represent the left and right singular vectors corresponding to 

the ith singular value, arranged in descending order 0, > 02 >... > Om > 0. 
The measures of stopping iteration can be given as 


eili) = |A — [ws (0)? (6) 
+ AP w(t) - IRF wD) <e, i=1,...,m, (12.112) 


where £ is a small number such as 10~?°. The algorithm is proved convergent. 

The algorithm can efficiently perform SVD of an ill-posed matrix. It can be 
used to solve the smallest singular component of the general matrix A. Although 
the method is indirect for computing the smallest singular component of a non- 
square matrix, it is efficient and robust. The algorithm is particularly useful for 
TLS problems. 


Extracting multiple principal singular components 


The double generalized Hebbian algorithm [121] is derived from a two-fold opti- 
mization problem: the left singular vector estimate is adapted by GHA, whereas 
the right singular vector estimate is adapted by the Widrow-Hoff rule. The linear 
approximation asymmetric PCA network [24] or orthogonal asymmetric encoder 
[121] is a two-layer feedforward linear network. Training is performed by BP. A 
stochastic online algorithm has been suggested [121], [24]. The network has a 
bottleneck topology. The cross-associative network for single component learn- 
ing is given in [31], where fixed points are the first principal left and right sin- 
gular vectors of A. The cross-associative neural network [32] is derived from 
a non-quadratic objective function which incorporates a matrix logarithm for 
extracting multiple principal singular components. 

Coupled online learning rules [61] are derived for SVD of a cross-covariance 
matrix of two correlated data streams. The coupled SVD rule is derived by 
applying Newton’s method to an objective function which is neither subject to 
minimization nor to maximization. Newton’s method guarantees nearly equal 
convergence speed in all directions, independent of the singular structure of A, 
and turns the saddle point into an attractor [92]. The online learning rules resem- 
ble PCA rules [92] and the cross-coupled Hebbian rule [23]. 


ww ai bbt.com DOOOO000 


410 


12.14 


Chapter 12. Principal component analysis 


A first-order approximation of GSO is used as a decorrelation method for 
estimation of multiple singular vectors and singular values. By inserting the 
first-order approximation or deflation, we can obtain the corresponding Hebbain 
SVD algorithms and the coupled SVD algorithms. 

Coupled learning rules for SVD produce better results than Hebbian learning 
rules. Combined with first-order approximation of GSO, precise estimates of sin- 
gular vectors and singular values with only small deviations from orthonormality 
are produced. Double deflation is clearly superior to standard deflation but infe- 
rior to first-order approximation of GSO, both with respect to orthonormality 
and diagonalization errors. Coupled learning rules converge faster than Hebbian 
learning rules, and the first-order approximation of GSO produces more precise 
estimates and better orthonormality than standard deflation [61]. Many SVD 
algorithms are reviewed in [61]. 


Canonical correlation analysis 


CCA [52], proposed by Hotelling in 1936, is a multivariate statistical technique. 
It makes use of two views of the same set of objects and projects them onto 
a lower-dimensional space in which they are maximally correlated. CCA seeks 
prominently correlated projections between two views of data and it has been 
long known to be equivalent to LDA when the data features are used in one view 
and the class labels are used in the other view [5], [46]. In other words, LDA is 
a special case of CCA. CCA is equivalent to LDA for binary-class problems [46], 
and it can be formulated as an LS problem for binary-class problems. 

CCA leads to a generalized EVD problem. Thus we can employ a kernelized 
version of CCA to compute a flexible contrast function for ICA. Generalized 
CCA consists of a generalization of CCA to more than two sets of variables [66]. 

Given two centered random multivariables x € R”? and y E€ R””, the goal of 
CCA is to find a pair of directions called canonical vectors wz and wy such that 
the correlation p(x, y) between the two projections w? 

Suppose that we are given a sample of instances S = {(@1, Y1), ---, (Ln, Yn)} of 
(x,y). Let Sz denote (#1,...,2,,) and similarly Sy denote (y;,...,y,,). We can 
consider defining a new coordinate for æ by choosing direction w, and projecting 
x onto that direction, x > wa. If we do the same for y by choosing direction 
Wwy, we obtain a sample of the new mapping for y. Let 


x and wiry is maximized. 


Bi, = (W3 £1,..., W3 En), (12.113) 
with the corresponding values of the mapping for y being 


Syw, = (Wy Y1- Wy Yn): (12.114) 
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The first stage of canonical correlation is to choose wy and wy to maximize the 
correlation between the two vectors 


(Sii ’ Sya) 





= o E e 2 
Fer Be: LW YWy 
After manipulation, we have 
wlE [zyT] wy we Cay Wy 
p= max Ma aa, 
Bey JWE [reT]wrwTE [yy] wy  \/wiCreWrw) Cyywy 
(12.116) 
where the covariance matrix of (a, y) is defined by 
T 
Cox x 
C(x,y) =E (*) (*) = | p | =C. (12.117) 
YS Y Cys Cyy 
The problem can be transformed into 
max wi CryWy (12.118) 
subject to 
wI CrsWr =1, wi Cyywy=1. (12.119) 


This optimiation problem can be solved by a generalized eigenvalue problem: 
CaoyWy = UCerWr, CyzsWr = VCyyWy, (12.120) 


where u and v are Lagrange multipliers. It can be derived that wz and wy are 
the eigenvectors of Cz = Cz},CayCyyCz, and Cy = C34 CZy CzCæy corre- 
sponding to their largest eigenvalues, respectively. 

Under a mild condition which tends to hold for high-dimensional data, CCA 
in the multilabel case can be formulated as an LS problem [124]. Based on this, 
efficient algorithms for solving LS problems can be applied to scale CCA to 
very large data sets. In addition, several CCA extensions, including the sparse 
CCA formulation based on Lı-norm regularization, are proposed in [124]. The 
LS formulation of CCA and its extensions can be solved efficiently. The LS 
formulation is extended to orthonormalized partial least squares by establishing 
the equivalence relationship between CCA and orthonormalized partial least 
squares [124]. The CCA projection for one set of variables is independent of 
the regularization on the other set of multidimensional variables. 

In [74], a strategy for reducing LDA to CCA is proposed. Within-class coupling 
CCA (WCCCA) is to apply CCA to pairs of data samples that are most likely 
to belong to the same class. Each one of the samples of a class, serving as the 
first view, is paired with every other sample of that class serving as the second 
view. The equivalence between LDA and such an application of CCA is proved. 

Two-dimensional CCA seeks linear correlation based on images directly. Moti- 
vated by locality-preserving CCA [123] and spectral clustering, a manifold learn- 
ing method called local two-dimensional CCA [133] identifies the local correlation 
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by weighting images differently according to their closeness. That is, the correla- 
tion is measured locally, which makes local two-dimensional CCA more accurate 
in finding correlative information. Local two-dimensional CCA is formulated as 
solving generalized eigenvalue equations tuned by Laplacian matrices. 

CCArc [70] is a two-dimensional CCA that is based on representing the image 
as the sets of its rows and columns and implementation of CCA using these sets. 
CCArc does not require preliminary downsampling procedure, it is not iterative 
and it is applied along the rows and columns of input image. Size of covariance 
matrices in CCArc is equal to max{ M, N }. Small-sample-size problem in CCArc 
does not occur, because we actually use N images of size M x 1 and M images 
of size N x 1; this always meets the condition max{ M, N} < (M + N). 

A method for solving CCA in a sparse convex framework [43] is proposed 
using an LS approach. Sparse CCA minimizes the number of features used in 
both the primal and dual projections while maximizing the correlation between 
the two views. When the number of the original features is large, sparse CCA 
outperforms kernel CCA, learning the common semantic space from a sparse 
set of features. Least-squares canonical dependency analysis [64] is an extension 
of CCA that can effectively capture complicated nonlinear correlations through 
maximization of the statistical dependency between two projected variables. 


12.1 Show that the Oja’s algorithm is asymptotically stable. 


12.2 Show that the discrete-time Oja rule is a good approximation of the nor- 
malized Hebbian rule. 


12.3 Show that the average Hessian H(w) in (12.14) is positive-definite only 
at w = Cı. 





12.4 For the data generated by the augoregressive process 
Tk = 0.8rp-1 + Ek, 


where e; is a zero-mean uncorrelated Gaussian driving sequence with unit vari- 
ance. The data points are arranged in blocks of size N = 6. Extract the first two 
minor components. 


12.5 The grayscale Lenna picture of the 512 x 512 pixels is split into nonover- 
lapping 8 x 8 blocks. Each block constructs a 64-dimensional vector. The vectors 
are selected randomly form an input sequence æ(k) € R4. Compute six principal 
directions and their direction cosines by using GHA with learning rates given by 
(12.46). Plot the reconstructed image and the picture SNR. 


12.6 Redo Example 12.1 by using APEX. 


12.7 Explain the function of the bottleneck layer in the five-layer autoassocia- 
tive neural network. 
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12.8 Generate 400 observations of three variates X1, X2, X3 according to X1 = 
4, X2 = Xı + 0.122, X3 = 1023, where Z1, Z2, Zg are independent standard 
normal variates. Compute and plot the leading principal component and factor 
analysis directions. 


12.9 In Example 12.4, PCA is used for image coding. Complete the example 
by quantizing each coefficient using a suitable number of bits. Calculate the 
compression ratio of the image. 


12.10 Given a data set generated by 


sin yi Ti ; 
i= > Ur= R =l +o, 100. 
Š E 7) 2 2n í 


a) For the samples, plot x2 against 21. 
b) Plot the discovered coordinates of x; obtained by CCA against yi. 
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13.1 


Nonnegative matrix factorization 


Introduction 


Matrix factorization or factor analysis is an important task that is helpful in 
the analysis of high-dimensional real-world data. SVD is a classical method for 
matrix factorization, which gives the optimal low-rank approximation to a real- 
valued matrix in terms of the squared error. Many application areas, including 
information retrieval, pattern recognition and data mining, require processing of 
binary rather than real data. 

Many real-life data or physical signals, such as pixel intensities, amplitude 
spectra, text corpora, gene expressions, air quality, information and occurrence 
counts, are naturally represented by nonnegative numbers. In the analysis of 
mixtures of such data, nonnegativity of the individual components is a reasonable 
constraint. A variety of techniques are available for analysis of such data, such 
as nonnegative PCA, nonnegative ICA and nonnegative matrix factorization 
(NMF) [30]. The goal of all of these techniques is to express the given nonnegative 
data as a guaranteed nonnegative linear combination of a set of nonnegative 
bases. 

NMF [30], also known as nonnegative matrix approximation or positive matrix 
factorization [34], is an unsupervised learning method for factorizing a matrix as 
a product of two matrices, in which all the elements are nonnegative. In NMF, 
the nonnegative constraint prevents mutual cancellation between basis functions 
and yields parts-based representations. NMF has become an established method 
for performing tasks such as BSS of images and nonnegative signals [1], spectra 
recovery [37], feature extraction [30], dimension reduction, segmentation and 
clustering [10], language modeling, text mining, neurobiology (gene separation), 
and gene expression profiles. 

The NMF problem is described as follows. Given a nonnegative matrix X, find 
nonnegative matrix factors A and S such that the difference measure between 
X and AS is the minimum according to some cost function: 


X= AS, (13.1) 


where X € R™*", the coefficient matrix A € R™**, the matrix of sources S € 
R**" elements in both A and S are nonnegative, and the rows in S may be 
statistically dependent to some extent. In other words, A can be seen as a basis 
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Figure 13.1 Nonuniqueness of NMF. 


13.2 


that is optimized for linear approximation of the data in X by minimizing E = 
|X — AS||?. It is usually selected that km + kn < mn for data reduction. 

NMF may not directly yield a unique decomposition [12]. It is characterized 
by a scale and permutation indeterminacies. Although the uniqueness may be 
improved by imposing some constraints to the factors, it is still a challeng- 
ing problem to uniquely identify the sources in general cases [26]. Figure 13.1 
illustrates the nonuniqueness problem in two dimensions. There is open space 
between the data points and the coordinate axes. We can choose the basis vectors 
hı and hz anywhere in this open space between the coordinate axes and data, 
and represent each data point exactly with a nonnegative linear combination of 
these vectors. Some well-posed NMF problems are obtained and the solutions 
are optimal and sparse under the separability assumption [17]. 

NMF and nonnegative tensor factorization decompose a nonnegative data 
matrix into a product of lower-rank nonnegative matrices or tensors. Boolean 
matrix factorization or Boolean factor analysis is the factorization of data sets 
in binary alphabet based on Boolean algebra [23]. Although both NMF and 
sparse coding learn sparse representation, they are different because NMF learns 
low-rank representation while sparse coding usually learns the full-rank repre- 
sentation. 


Algorithms for NMF 


NMF optimization problems are usually nonconvex. NMF is usually performed 
with an alternating gradient-descent technique that is applied to the squared 
Euclidean distance or Kullback-Leibler divergence. The two measures are uni- 
fied by using the parameterized cost functions such as ({-divergence [25] or a 
broader class called Bregman divergence [9]. This approach belongs to a class 
of multiplicative iterative algorithms [30]. In spite of low complexity, it con- 
verges slowly, gives only a strictly positive solution, and can easily fall into local 
minima of a nonconvex cost function. Another popular algorithm is alternating 
non-negative least squares (ANLS) [34]. 

An algorithm for NMF can be applied to BSS by adding two suitable regu- 
larization terms in the original objective function of NMF to increase sparseness 
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and/or smoothness of the estimated components [5]. In a pattern-expression 
NMF approach [51] to BSS, two regularization terms are added to the original 
loss function of standard NMF for effective expression of patterns with basis 
vectors in the pattern-expression NMF. Nonsmooth NMF finds localized, parts- 
based representations of nonnegative multivariate data items [35]. 

Projected gradient methods are highly efficient in solving large-scale convex 
minimization problems subject to linear constraints. The NMF projected gra- 
dient algorithms [6], [49] are quite efficient for solving large-scale minimization 
problems subject to nonnegativity and sparsity constraints. 

There are strong ties between NMF and a family of probabilistic latent variable 
models used for analysis of nonnegative data [38]. The latent variable decomposi- 
tions are numerically identical to the NMF algorithm that optimizes a Kullback- 
Leibler metric. In [3], NMF with a Kullback-Leibler error measure is described 
in a statistical framework, with a hierarchical generative model consisting of 
an observation and a prior component. Omitting the prior leads to standard 
KL-NMF algorithms. 


Multiplicative update algorithm and alternating nonnegative least squares 
The NMF problem can be formulated as 
min |X — AS|| F, (13.2) 
A,S 


subject to the constraints that all elements of A and S are nonnegative. 
The multiplicative update rule for NMF is given by [30]: 


ATX 
XST 
ACARA (13.4) 


where ® and / denote elementwise multiplication and division, respectively. The 
matrices A and S are initialized with positive random values. These equations 
iterate, guaranteeing monotonical convergence to a local maximum of the objec- 
tive function [30]: 


F= 3 2 (Xin n(AS);,, — (AS) in). (13.5) 


After learning the NMF basis vectors A, new data in matrix X’ are mapped to 
k-dimensional space by fixing A and then randomly initializing S and iterating 
until convergence; or by fixing A and then solving an LS problem X’ = AS’ for 
S’ using pseudoinverse. The LS solution can produce negative entries of S’. One 
can enforce nonnegativity through setting negative values to zero or by using 
nonnegative LS. Setting negative values to zero is much computationally simpler 
than solving LS with nonnegativity constraints, but some information is lost 
after zeroing. 
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Figure 13.2 NMF applied to the ORL face image database. (a) The basis images for matrix A. (b) 
The evolution of the objective function. 




















The multiplicative update algorithm may fail to converge to a stationary point 
[32]. In a modified strategy [32], if a whole column of A is zero, then it as well 
as the corresponding row in S are unchanged. The convergence of this strategy 
is proved. This modified strategy can ensure the modified sequence is bounded. 


Example 13.1: By using the software package NMFPACK (http://www.cs. 
helsinki.fi/patrik.hoyer/ [22]), we implement the multiplicative update 
algorithm with the Euclidean objective for a parts-based representation of the 
ORL face image database. For the ORL database, m = 92, k = 25 and n = 400. 
Basis images derived from the ORL face image database is shown as well as the 
objective function throughout the optimization is shown in Fig. 13.2. It is shown 
that the multiplicative update algorithm converges very slowly. 


Alternating nonnegative least squares is a block coordinate descent in bound- 
constrained optimization: 


. 1 
min P(A, S) = 5|[X— ASI, (13.6) 


subject to the constraints that all elements of A and S are nonnegative. 
The algorithm can be implemented as two alternating convex optimization 
problems 


Agi = arg min F(A, Sx), 
Sk+1 





arg min F'(Ax+1,8). (13.7) 
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The method can be very fast, work well in practice, and have fast convergence. 
A modified strategy proposed in [32] is applied to ensure that the sequence 
generated has at least one limit point, and this limit point is a stationary point 
of NMF [29]. 

Since a gradient algorithm for standard NMF learning is of slow convergence 
for large-scale problems, some efforts have been reported in applying projected 
gradient method [31], projected alternating LS method [1] or projected Newton 
method [50] to NMF for speeding up its convergence. NMF may not yield unique 
decomposition. Some modifications of NMF have been reported for the decom- 
position to be unique by imposing the sparseness constraint on mixing matrix 
or source matrix, or both [22]. Possible circumstances under which NMF yields 
unique decomposition can also be found in [26]. 

An improved algorithmic framework for the LS NMF problem [24] overcomes 
many deficiencies of gradient descent-based methods including the multiplica- 
tive update algorithm and the alternating LS heuristic. This framework readily 
admits powerful optimization techniques such as the Newton, BFGS and CG 
methods, and includes regularization and box-constraints, thus overcoming defi- 
ciencies without sacrificing convergence guarantees. 

Since A and/or S are usually sparse matrices, a hybrid approach called the 
gradient projection CG algorithm is adapted for NMF. The a-divergence is used 
to unify many well-known cost functions. In a projected quasi-Newton method 
[50], a regularized Hessian with the LM approach is inverted with the Q-less 
QR decomposition. The method uses the quasi-Newton iterates for updating A 
and the fixed-point regularized LS algorithm for computing S. The best result is 
obtained with the quasi-Newton fixed-point algorithm. The gradient projection 
CG gives slightly worse results. The algorithms are implemented in the MATLAB 
toolbox: NMFLAB for Signal and Image Processing [4]. 

Without the explicit assumption of independence, NMF for BSS discussed in 
[5] achieves the estimations of the original sources from the mixtures. Using the 
invariant set method, the convergence area for the NMF based BSS algorithm is 
obtained [45]. NMF can also be implemented incrementally for data streams; two 
examples are the online NMF algorithm proposed in [19] and the incremental 
orthogonal projective NMF algorithm proposed in [40]. 


Other NMF methods 


Sparse NMF can be implemented by constraining or penalizing the Lı-norm of 
the factor matrices into the NMF cost function, or by constraining the Lo-norm 
of either of the factor matrices [36]. NMF with minimum-volume-constraint [53] 
can improve the sparseness of the results of NMF. This sparseness is Lo-norm ori- 
ented and can give desirable results even in very weak sparseness situations. The 
model based on quadratic programming is quite efficient for small-scale prob- 
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lems, and the model using multiplicative updates incorporating natural gradient 
is more suitable for large-scale problems. 

For NMF, the learned basis is theoretically proved to be unnecessarily parts- 
based [12]. By introducing the manifold regularization and the margin maxi- 
mization to NMF, the manifold regularized discriminative NMF can produce 
parts-based basis by using a Newton-based fast gradient descent algorithm. 

Projective NMF approximates a data matrix by its nonnegative subspace pro- 
jection [43], [44] 


X x PX, (13.8) 


where P is a positive low-rank matrix. The objective function can be measured 
by Frobenius matrix norm or a modified Kullback-Leibler divergence. Both mea- 
sures are minimized by multiplicative update rules, whose convergence is proven 
n [44]. A nonnegative multiplicative version of Oja’s learning rule can be used 
for computing projective NMF [483]. 

Compared with NMF, X ~ AS, projective NMF replaces S with ATX. This 
brings projective NMF close to nonnegative PCA. The term projective refers to 
the fact that AAT is indeed a projection matrix if A is an orthogonal matrix: 
ATA = L. In projective NMF learning, A becomes approximately orthogonal. 
This has positive consequences in sparseness of the approximation, orthogonal- 
ity of the factorizing matrix, decreased computational complexity in learning, 
close equivalence to clustering, generalization of the approximation to new data 
without heavy recomputations, and easy extension to a nonlinear kernel method 
with wide applications for optimization problems. 

Object representation in the inferior temporal cortex, an area of visual cortex 
critical for object recognition in the primate, exhibits two prominent properties 
[21]: objects are represented by the combined activity of columnar clusters of neu- 
rons, with each cluster representing component features or parts of objects, and 
closely related features are continuously represented along the tangential direc- 
tion of individual columnar clusters. Topographic NMF [21] is a learning model 
that reflects these properties of parts-based representation and topographic orga- 
nization in a unified framework. 

Topographic NMF [21] incorporates neighborhood connections between NMF 
basis functions arranged on a topographic map. With this extension, the non- 
negative constraint leads to an overlapping of basis functions along neighboring 
structures. Topographic NMF incorporates neighborhood connections between 
NMF basis functions arranged on a topographic map. Nonnegativity of NMF 
has been related to the network properties such as firing rate representation and 
signed synaptic weight [30]. Topographic NMF represents an input by multiple 
activity peaks to describe diverse information, whereas conventional topographic 
models such as SOM represent an input by a single activity peak in a topographic 
map. Topographic NMF reconstructs the neuronal responses better than SOM. 

A topology-preserving NMF method [52] is derived from original NMF algo- 
rithm by preserving local topology structure. It is based on minimizing the con- 
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straint gradient distance in high-dimensional space. Compared with L? distance, 
the gradient distance is able to reveal latent manifold structure of face patterns. 
The topology-preserving NMF finds an embedding that preserves local topology 
information, such as edges and texture. 

In the case of the data being highly nonlinearly distributed, it is desirable to 
kernelize NMF. Projected gradient kernel NMF [47] is a nonlinear nonnegative 
component analysis method using kernels. Arbitrary positive definite kernels can 
be adopted. Use of projected gradient procedure [31] guarantees that the limit 
point of the algorithm is a stationary point of the optimization procedure. The 
proposed method leads to better classification rates when compared with kernel 
PCA and kernel ICA [47]. 

Discriminant NMF [46] is a supervised NMF approach to enhancing the clas- 
sification accuracy by introducing Fisher’s discriminative information to NMF. 
Semi-supervised NMF [27] is formulated as a joint factorization of the data 
matrix and the label matrix, sharing a common factor matrix S for consis- 
tency. Constrained NMF is a semi-supervised matrix decomposition method, 
which incorporates the label information as additional constraints [33]. Combin- 
ing label information improves the discriminating power of the resulting matrix 
decomposition. 

Semi-NMF is defined by X = AS, where the elements of S are nonnegative 
but X and A are not constrained [11]. Convex-NMF is obtained when the basis 
vectors of A are convex combinations of the data points [11]. This is used for a 
kernel extension of NMF. Convex-NMF applies to both nonnegative and mixed- 
sign data matrices, and both factor matrices tend to be very sparse. 

Symmetric NMF is a special case of NMF, X ~ PP’, in which P is a nonnega- 
tive factor and X is completely positive. Weighted symmetric NMF or symmetric 
nonnegative tri-factorization is defined by X = PQP’, where Q is a symmetric 
nonnegative matrix. By minimizing the Euclidean distance, parallel multiplica- 
tive update algorithms are proposed in [20], with proved convergence under mild 
conditions. These algorithms are applied to probabilistic clustering. Quadratic 
NMF [48] is a class of approximative NMF methods, where some factorizing 
matrices occur twice in the approximation. 

Another matrix factorization method is latent semantic indexing [8]. The CX 
algorithm is a column-based matrix decomposition algorithm, while the CUR 
algorithm is a column-row-based one [14]. In the CX algorithm [13], [14], a 
matrix A composing sample vectors is decomposed into two matrices C and 
X. For term-document data and binary image data, the columns of A are sparse 
and nonnegative. The prototype-preserving property of the CX algorithm makes 
the columns of C sparse and nonnegative too. The CX algorithm samples a small 
number of columns by randomly sampling columns of the data matrix according 
to a constructed nonuniform probability which is derived from SVD. A deter- 
ministic version of CX algorithm [28] selects columns in a deterministic manner, 
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which well approximates SVD. Each selected column is related to an eigenvector 
of PCA or incremental PCA. 

Probabilistic latent semantic analysis solves the problem of NMF with KL 
divergence [16]. NMF and probabilistic latent semantic analysis are both 
instances of multinomial PCA [2]. 


NMF methods for clustering 


Under certain assumption, NMF is equivalent to clustering [22, 10]. The matrix 
A is considered to be a centroid matrix as every column represents a cluster 
center, while S is the cluster membership matrix. Orthogonal NMF is NMF 
with orthogonality constraint on either the factor A or S [10]. Orthogonal NMF 
is equivalent to C-means clustering, and orthogonal NMF based on the factor 
A or S is identical to clustering the rows or columns of an input data matrix, 
respectively [41]. Cluster-NMF [11] is an idea similar to projective NMF; it is 
based on Frobenius norm and is a particular case of convex-NMF. Cluster-NMF 
is close to C-means clustering. 

Concept factorization is an extension of NMF for data clustering [42]. It models 
each cluster (concept) rz, k = 1,...,p, as a nonnegative linear combination of 
the data points x;, i =1,...,n, and each data point x; as a nonnegative linear 
combination of all the cluster centers (concepts) rz. Data clustering is then 
accomplished by computing the two sets of linear coefficients, which is carried 
out by finding the nonnegative solution that minimizes the reconstruction error 
of the data points. Concept factorization can be performed in either the original 
space or RKHS. It essentially tries to find the approximation X ~ XWVT, with 
elements of W and V being nonnegative. Similar to NMF, it aims to minimize 
Ecr = $||X — XWV7||?. When fixing W and V alternatively, multiplicative 
updating rules are as given in [42]. The superiority of concept factorization over 
NMF is shown for document clustering in [42]. 


13.1 Characterize the following two cases as NMF with spareness constraints 
[22]. 

(a) A doctor analyzing disease patterns might assume that most diseases are 
rare (hence sparse) but that each disease can cause a large number of symptoms. 
Assuming that symptoms make up the rows of her matrix and the columns 
denote different individuals, in this case it is the coefficients which should be 
sparse and the basis vectors unconstrained. 

(b) When trying to learn useful features from a database of images, it might 
make sense to require both A and S to be sparse, signifying that any given 
object is present in few images and affects only a small part of the image. 


13.2 Sparseness measures quantify as to how much energy of a vector is con- 
tained in only a few of its components. One of the sparseness measures is defined 
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based on the relationship between the Ly-norm and the La norm [22]: 


(a) vn- vest i; 
sparseness(x) = —, 
Vn—1 


where n is the dimensionality of x. Explain why this definition is reasonable. 


13.3 Give features of the ORL face image database using nmfpack (http: 
//waw.cs.helsinki.fi/u/phoyer/software.html). 
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Independent component analysis 


Introduction 


Imagine that you are attending a cocktail party, the surrounding is full of chat- 
ting and noise, and somebody is talking about you. In this case, your ears are 
particularly sensitive to this speaker. This is the cocktail-party problem, which 
can be solved by blind source separation (BSS). 

BSS is a very active and commercially driven topic, and has found wide appli- 
cations in a variety of areas including mobile communications, signal processing, 
separating audio signals, demixing multispectral images, biomedical systems, and 
seismic signal processing. In medical systems, BSS is used to identify artifacts 
and signals of interest from the analysis of functional brain imaging signals, such 
as electrical recordings of brain activity as given by magnetoencephalography 
(MEG) or electroenchaphalography (EEG), and functional magnetic resonance 
imaging (fMRI). BSS has been applied to extract the fetal electrocardiogra- 
phy (FECG) from the electrocardiography (ECG) recordings measured on the 
mother’s skin. 

ICA [32], as a generalization of PCA, is a statistical model. The goal of ICA is 
to recover the latent components from observations. ICA finds a linear represen- 
tation of non-Gaussian data so that the components are statistically independent, 
or as independent as possible. ICA has now been widely used for BSS, feature 
extraction, and signal detection. 

For BSS applications, the ICA model is required to have model identifiability 
and separability [32]. ICA corresponds to a class of methods with the objective of 
recovering underlying latent factors present in the data. The observed variables 
are linear mixtures of the components which are assumed to be mutually indepen- 
dent. Instead of obtaining uncorrelated components as in PCA, ICA attempts to 
linearly transform the original inputs into features that are statistically mutually 
independent. Independence is a stronger condition than uncorrelatedness, and is 
equivalent to uncorrelatedness only in the case of Gaussian distributions. The 
first neural network model with a heuristic learning algorithm, which is related 
to ICA, was developed for online BSS of linearly mixed signals in [58]. 

BSS is not identical to ICA since methods using second-order statistics can be 
used for BSS. These second-order statistics approaches are not restricted by the 
Gaussianity of the sources but rather, require that the sources, although inde- 
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pendent of one another, are colored in the temporal domain. BSS methods are 
typically decorrelation-based and ICA-based. Decorrelation methods minimize 
the squared cross-correlation between all pairs of source estimates at two or more 
lags. They are useful for BSS when the sources possess sufficient spectral diver- 
sity even if all the sources are Gaussian distributed. Conversely, ICA methods 
minimize the statistical dependence of the source estimates at lag 0. A spatio- 
temporal BSS method would be appropriate for either of the two aforementioned 
cases since it combines ICA and decorrelation criteria into one algorithm. 

Most algorithms for ICA, directly or indirectly, minimize the mutual infor- 
mation between the component estimates, which corresponds to maximization 
of the negentropy, a measure of non-Gaussianity of the components [53]. The 
exact maximization of the negentropy is difficult and computationally demanding 
because a correct estimation of the source densities is required. Most of the exist- 
ing ICA algorithms can be viewed as approximating negentropy through simple 
measures, such as high-order cumulants [46], [53]. Most of the ICA algorithms 
based on unsupervised learning belong to the Hebb-type rule or its generalization 
with adopting nonlinear functions. 


ICA model 


Let a J,-vector æ denote a linear mixture and a J2-vector s, whose components 
have zero mean and are statistically mutually independent, denote the original 
source signals. The ICA model can be defined by 


zr =As+n, (14.1) 


where A is a constant full-rank Jı x J. mixing matrix whose elements are the 
unknown coefficients of the mixtures, and n denotes an additive noise term, 
which is often omitted since it is usually impossible to separate noise from the 
sources. ICA takes one of three forms, namely, square ICA for Jı = J2, overcom- 
plete ICA for Jı < J2, and undercomplete ICA for Jı > J2. While undercomplete 
ICA is useful for feature extraction, overcomplete ICA may be applied to signal 
and image processing methods based on multiscale and redundant basis sets. 
The goal of ICA is to estimate s by 


y= Wr (14.2) 


such that the components of y, which is the estimate of s, are statistically as 
independent as possible. W is a Jı x Jp demixing matrix. In the ICA model, 
two ambiguities hold: one cannot determine the variances (energies) of the inde- 
pendent components; one cannot determine the order of the independent com- 
ponents. ICA can be considered a variant of projection pursuit. 

The statistical independence property implies that the joint probability density 
of the components of s equals the product of the marginal densities of the indi- 
vidual components. Each component of s is a stationary stochastic process and 
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Figure 14.1 Illustration of PCA and ICA for a two-dimensional non-Gaussian data set. 


14.3 


only one of the components is allowed to be Gaussian distributed. The higher- 
order statistics of the original inputs is required for estimating s, rather than the 
second-order moment or covariance of the samples as used in PCA. Notice that 
the MSE for any estimate of a nonrandom parameter has a lower bound, called 
the Cramer-Rao bound. This lower bound defines the ultimate accuracy of any 
estimator, and is closely related to the ML estimator. To estimate a vector of 
parameters 0 from a data vector x that has a probability density, by using some 
unbiased estimator 6, the Cramer-Rao bound, which is the lower bound for the 
variance of Ê on estimating the source signals in ICA is derived in [67], based on 
the assumption that all independent components have finite variance. 

Two distinct characteristics exist between PCA and ICA. The components of 
the signal extracted by ICA are statistically independent, not merely uncorre- 
lated as in PCA. The demixing matrix W of ICA is not orthogonal, while in 
PCA the components of the weights are represented on an orthonormal basis. 
ICA provides in many cases a more meaningful representation of the data than 
PCA does. ICA can be realized by adding nonlinearity to linear PCA networks 
such that they are able to improve the independence of their outputs. In [61], an 
efficient ICA algorithm is derived by minimizing a nonlinear PCA criterion using 
the RLS approach. A conceptual comparison of PCA and ICA is illustrated in 
Fig. 14.1. wica and wPCA, i = 1,2, are the ith principal and ith independent 
directions, respectively. 


Approaches to ICA 


A well-known two-phase approach to ICA is to preprocess the data by PCA, 
and then to estimate the necessary rotation matrix. A generic approach to ICA 
consists of preprocessing the data, defining measures of non-Gaussianity, and 
optimizing an objective function, known as a contrast function. Some measures 
of non-Gaussianity are kurtosis, differential entropy, negentropy, and mutual 
information, which can be derived from one another. For example, one approach 
is to minimize the mutual information between the components of the output 
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vector 
Hy) = H (vi) — Hy), (14.3) 


where H(y)=—fp(y)Inp(y)dy is the joint entropy, and H (yj) = 
— f pi (yi) np; (yi) dy; is the marginal entropy of component i, p(y) being 
the joint pdf of all the elements of y and p;(y;) the marginal pdf of yi. The 
mutual information J > 0 and is zero only when the components are mutually 
independent. 

The classical measure of non-Gaussianity is kurtosis. Kurtosis is the degree of 
peakedness of a distribution, based on the fourth central moment of the distri- 
bution. The kurtosis of y; is classically defined by 


kurt (yi) = E [y4] — 3 (E [y?])”. (14.4) 


If kurt (y;) < 0, y;(t) is a sub-Gaussian source, while for super-Gaussian sources 
kurt (y;) > 0. For a Gaussian y;, the fourth moment equals 3 (E |y?] is and thus, 
the kurtosis of Gaussian sources is zero. Super-Gaussian random variables have 
typically a spiky pdf with heavy tails, i.e. the pdf is relatively large at zero 
and at large values of the variable, while being small for intermediate values. 
A typical example is the Laplace distribution. Sub-Gaussian random variables 
have typically a flat pdf, which is rather constant near zero, and very small 
for larger values of the variable. Typically non-Gaussianity is measured by the 
absolute value of kurtosis. However, kurtosis has to be estimated from a measured 
sample, and it can be very sensitive to outliers. 

Negentropy is a measure of non-Gaussianity that is zero for a Gaussian variable 
and always nonnegative: 


J(y) = H (Y ssuss) — H(y), (14.5) 


where Ygauss IS a Gaussian random variable of the same covariance matrix as 
y. Negentropy is invariant for invertible linear transformations [32]. Entropy is 
small for distributions that are clearly concentrated on certain values, i.e., when 
the variable is clearly clustered or has a pdf that is very spiky. In fact, negentropy 
is in some sense the optimal estimator of non-Gaussianity, as far as statistical 
properties are concerned. Computation of negentropy is very difficult. Therefore, 
simpler approximations of negentropy are very useful. 

Two common approaches in ICA algorithms are minimum mutual information 
and maximum output entropy approaches. A natural measure of independence is 
mutual information [32], a nonnegative scalar that equals zero when the signals 
are independent. Mutual information estimation of continuous signals is notori- 
ously difficult but when prewhitening of the observed signal space is performed, 
mutual information minimization becomes equivalent to finding the orthogonal 
directions for which the negentropy is maximized [32]; thus much research has 
focused on developing one-dimensional approximations of negentropy and dif- 
ferential entropy. Minimization of output mutual information is the canonical 
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contrast for BSS [19]. Three best-known methods for ICA, namely JADE [19], 
infomax [10], and FastICA [45], [49], use diagonalization of cumulant matrices, 
maximization of output entropy, and fourth-order cumulants, respectively. Two 
other popular ICA algorithms are the natural-gradient [3] and the equivariant 
adaptive separation via independence (EASI) [21]. These methods can be eas- 
ily extended to the complex domain by using Hermitian transpose and complex 
nonlinear functions. 

In the context of BSS, higher-order statistics are necessary only for temporally 
uncorrelated stationary sources. Second-order statistics-based source separation 
exploits temporally correlated stationary sources [29] and the nonstationarity 
of the sources [78, 29]. Many natural signals are inherently nonstationary with 
time-varying variances, since the source signals incorporate time delays into the 
basic BSS model. 


The overlearning problem 

In the presence of insufficient samples, most ICA algorithms produce very sim- 
ilar types of overlearning [50]. These consist of source estimates that have a 
single spike or bump and are practically zero everywhere else, regardless of the 
observations a. The overlearning problem in ICA algorithms based on marginal 
distribution information is discussed in [94]. The resulting overlearned compo- 
nents have a single spike or bump and are practically zero everywhere else. This 
is similar to the classical overlearning in linear regression. The solutions to this 
spike problem include the acquisition of more samples, or the reduction of dimen- 
sions. The reduction of dimensions is a more efficient way to avoid the problem, 
provided that there are more sensors than sources. 

The overlearning problem cannot be solved by acquiring more samples nor by 
dimension reduction, when the data has strong time dependencies, such as a 1/f 
power spectrum. This overlearning is better characterized by bumps instead of 
spikes. This spectrum characteristic is typical of MEG as well as many other 
natural data. Due to its 1/f nature, MEG data tends to show a bump-like 
overlearning, much like the one in the random walk data set, rather than the 
spike type observed for the Gaussian i.i.d. data set. Asymptotically, the kurtoses 
of the spikes and bumps tend to zero, when the number of samples increases. 


Popular ICA algorithms 


Infomax ICA 


The infomax approach [10] aims to maximize the mutual information between 
the observations and the nonlinearly transformed outputs of a set of linear filters. 
It is a gradient-based technique implementing entropy maximization in a single- 
layer feedforward network. 
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A certain transformation is applied to the inputs, 
yi = (Wie), i=1,..., J2, (14.6) 


where f;(-) is a monotone squashing function such as a sigmoid. Maximizing 
the mutual information between outputs y and inputs æ, which is equivalent 
to maximizing the entropy of y due to the deterministic relation (14.6), would 
lead to independent components. This effect can be understood through the 
decomposition 


H(y)=) A; (yi) — I (Y1; -3 YJa) > (14.7) 


where H (y) is the joint entropy of y, H; (yi) is the individual entropy, and T is 
the mutual information among y;’s. Maximizing the joint entropy thus involves 
maximizing the individual entropies of y;’s and minimizing the mutual informa- 
tion between y;’s. 

For square representations the infomax approach turns out to be equivalent to 
the causal generative one if we interpret f;(-) to be the cumulative distribution 
function of p;(-) [22]. 

The natural gradient solution is obtained as [10]: 


Wi (t+1) = W(t) +n [I — 29(y)y"] WT), (14.8) 


where y is the learning rate, and g(y) = tanh(y) for source signals with positive 
kurtosis. 

Infomax can be seen as an ML one [22] or as a mutual information-based one. 
It can be interpreted as assuming some given, a-priori marginal distributions 
for yi. MISEP [2] is an infomax-based ICA technique for linear and nonlinear 
mixtures, but estimates the marginal distributions in a different way, based on a 
maximum entropy criterion. MISEP generalizes infomax in two ways: to deal with 
nonlinear mixtures, and to be able to adapt to the actual statistical distributions 
of the sources, by dynamically estimating the nonlinearities to be used at the 
outputs. MISEP optimizes a network with a specialized architecture, with the 
output entropy as objective function. The numbers of components of s and y 
are assumed to be the same. 

Infomax is better suited to estimation of super-Gaussian sources: sharply 
peaked pdfs with heavy tails. It fails to separate sources that have negative 
kurtosis [10]. An extension of the infomax algorithm [71] is able to blindly sepa- 
rate mixed signals with sub- and supergaussian source distributions, by using a 
simple type of learning rule by choosing negentropy as a projection pursuit index. 
Parameterized probability distributions with sub- and supergaussian regimes are 
used to derive a general learning rule. This general learning rule preserves the 
simple architecture proposed by [10], is optimized using the natural gradient [5], 
and uses the stability analysis given in [21] to switch between sub- and super- 
gaussian regimes. 
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The BSS algorithm in [90] is a tradeoff between gradient infomax and natu- 
ral gradient infomax. Desired equilibrium points are locally stable by choosing 
appropriate score functions and step sizes. The algorithm provides better per- 
formance than the gradient algorithm, and it is free from approximation error 
and the small-step-size restriction of the natural gradient algorithm. The proof 
of local stability of the desired equilibrium points for BSS is given by using 
monotonically increasing and odd score functions. 


EASI, JADE, and natural-gradient ICA 
The EASI rule is given by [21] 
W(t+1) = W(t) +n [yg(y") — o(y)y7] WT), (14.9) 


where 7 is the learning rate, and nonlinearity is usually simple cubic polyno- 
mials, g(y) = y®. In [93], the optimal choice of these nonlinearities is addressed. 
This optimal nonlinearity is the output score function difference. It is a multi- 
variate function which depends on the output distributions. The resulting quasi- 
optimal EASI can achieve better performance than standard EASI, but requires 
an accurate estimation of score function difference. However, the method has a 
great advantage to converge for any source, contrary to standard EASI whose 
convergence assumes a condition on the source statistics [21]. 

JADE [19] is an exact algebraic approach to perform ICA. It is based on 
joint diagonalisation of the fourth-order cumulant tensors of prewhitened input 
data. The bottleneck with JADE when dealing with high-dimensional problems 
is the algebraic determination of the mixing matrices. JADE is based on the 
estimation of kurtosis via cumulants. A neural implementation of JADE [115] 
adaptively determines the mixing matrices to be jointly diagonalized with JADE. 
The learning rule uses higher-order neurons and generalizes Oja’s PCA rule. 

Natural gradient learning for Jı = J2 [3], [5] is the true steepest-descent 
method in the Riemannian parametric space of the nonsingular matrices. It is 
proved to be Fisher-efficient in general, having the equivariant property. Natural 
gradient learning is extended to the overcomplete and undercomplete cases in 
[6]. The observed signals are assumed to be whitened by preprocessing, so that 
we can use the natural Riemannian gradient in Stiefel manifolds. The objective 
function is given by 


By = —In|det W| — X` log p:(y:(t)), (14.10) 
i=1 
where y = Wa(t) and p;(-) represents the hypothesized pdf for the latent vari- 
able s;(t) (or its estimate y;(t)). 
The natural gradient ICA algorithm, which iteratively finds a minimum of 
(14.10), has the form [3] 


W(t +1) = Wit) + n [E- g(ult))¥7()] WE, (14.11) 
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where 7 > 0 and g(y) = (g1 (y1), -- - , ga (yn))” , each element of which corresponds 
to the negative score function, i.e., gi(yi) = —dlogp;(yi)/dy;. Function g(-) is 
given by the polynomial g(y) = 2y? — 4y” — Hy? + By? + 3y, 

Differential ICA [30] is a variation of natural gradient ICA, where learning 
relies on the concurrent change of output variables. Differential learning is inter- 
preted as the ML estimation of parameters with latent variables represented by 
the random walk model. The differential anti-Hebb rule is a modification of the 
anti-Hebb rule. It updates the synaptic weights in a linear feedback network in 
such a way that the concurrent change of neurons is minimized. 

The relative gradient [21] or the natural gradient [5] is efficient in learning 
the parameters when the parametric space belongs to the Riemannian manifold. 





The relative gradient leads to algorithms having the equivariant property which 
produces the uniform performance, regardless of the condition of the mixing 
matrix in the task of BSS or ICA. When the mixing matrix is ill-conditioned, 
the relative gradient ICA algorithms outperform other types of algorithms. The 
relative gradient is a particular instance of the relative optimization [114]. 


FastICA algorithm 


FastICA is a well-known fixed-point ICA algorithm [45, 49]. It is derived from 
the optimization of the kurtosis or the negentropy measure by using Newton’s 
method. FastICA achieves reliable and at least quadratic convergence. It can be 
considered as a fixed-point algorithm for ML estimation of the ICA model. Fas- 
tICA is parallel, distributed, computationally simple, and requires little memory 
space. 

FastICA estimates multiple independent components one by one using a GSO- 
like deflation scheme. It first prewhitens the observed data to remove any second- 
order correlations, and then performs an orthogonal rotation of the whitened data 
to find the directions of the sources. FastICA is very simple, does not depend on 
any user-defined parameters, and rapidly converges to the most accurate solu- 
tion allowed by the data. The algorithm finds, one at a time, all non-Gaussian 
independent components, regardless of their probability distributions. It is per- 
formed in either batch mode or a semiadaptive manner. The convergence of the 
algorithm is rigorously proved, and the convergence speed is shown to be cubic. 

The original FastICA [45] based on kurtosis nonlinearity is non-robust due 
to the sensitivity of the sample fourth-order moment to outliers. Consequently, 
nonlinearities are offered as more robust choices [49]. A rigorous statistical anal- 
ysis of the deflation-based FastICA estimator is provided in [88]. The derived 
compact closed-form expression of the influence function reveals the vulnerabil- 
ity of the FastICA estimator to outliers regardless of the nonlinearity used. The 
influence function allows the derivation of a compact closed-form expression for 
the asymptotic covariance matrix of the FastICA estimator and subsequently its 
asymptotic relative efficiencies. 
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The mixtures æ; are first prewhitened according to Section A.2 
v(t) = Vi a, (14.12) 


where v(t) is the whitened mixture and V denotes a Jı x J2 whitening matrix. 
The components of v(t) are mutually uncorrelated with unit variances, namely, 
E [v(t)v(t)7] = Iz. 

The demixing matrix W is factorized by 


wT = UTVT, (14.13) 


where U = [w,--- ,uy,] is the Jo x Jo orthogonal separating matrix, that is, 
UTU = I;,. The vectors u; can be obtained by iteration: 





Ù; = E [vg (uf v)] -E [ġ (už v)] ui i=1,..., J2, (14.14) 
Ùi . 

g == =- > =1,...,Ja, 14.15 

“Tar | a 


2 


where g(-) can be selected as gi(x) = tanh(ax) and g(x) = xe~ = . The indepen- 
dent components can be estimated in a hierarchical fashion, that is, estimated 
one by one. After the ith independent component is estimated, u; is orthogonal- 
ized by an orthogonalization procedure. 

FastICA can also be implemented in a symmetric mode, where all the inde- 
pendent components are extracted and orthogonalized at the same time [49]. A 
similar fixed-point algorithm based on the nonstationary property of signals is 
derived in [52]. FastICA is easy to use and there are no step-size parameters 
to choose, while gradient descent-based algorithms seem to be preferable only if 
fast adaptivity in a changing environment is required. FastICA directly finds the 
independent components of practically any non-Gaussian distribution using any 
nonlinearity g(-) [49]. 

For FastICA, Hyvarinen suggested three different functions [49]: 


ga(s) = logcosh(s), (14.16) 
gs(s) = — exp(—s”/2), (14.17) 
go(s) = s4 /4. (14.18) 


ga is a good general-purpose function and gs is justified if robustness is very 
important. For sources of fixed variance, the maximum of gg coincides with the 
maximum of kurtosis. All these contrast functions can be viewed as approxi- 
mations of negentropy. Cumulant-based approximations, such as gg, are mainly 
sensitive to the tails of the distributions, and thus sensitive to outliers as well 
[46]. 

In [41], a comprehensive experimental comparison has been conducted on dif- 
ferent classes of ICA algorithms including FastICA, infomax, natural-gradient, 
EASI, and an RLS-based nonlinear PCA [60]. The fixed-point FastICA with 
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symmetric orthogonalization and tanh nonlinearity gı(x) is concluded as the 
best tradeoff for ICA, since it provides results similar to that of infomax and 
natural-gradient, which are optimal with respect to minimizing the mutual infor- 
mation, but with a clearly smaller computational load. When g(x) = g3(x) = 2°, 
the fixed-point FastICA algorithm achieves cubic convergence; however, the algo- 
rithm is less accurate than the case when tanh nonlinearity is used [41]. 

FastICA can approach the Cramer-Rao lower bound in two situations [67], 
namely, when the distribution of the sources is nearly Gaussian and the algorithm 
is in symmetric mode using the nonlinear function gi (x), g2(x) or g3(x), and when 
the distribution of the sources is very different from Gaussian and the nonlinear 
function equals the score function of each independent component. A closed-form 
expression for the Cramer-Rao bound on estimating the source signals in the 
linear ICA problem is derived in [67], assuming that all independent components 
have finite variance. An asymptotic performance analysis of FastICA in [101] 
derives the exact expression for this error variance. The accuracy of FastICA is 
very close, but not equal to, the Cramer-Rao bound. The condition for this is 
that the nonlinearity g(-) in the FastICA contrast function is the integral of the 
score function w(s) of the original signals, or the negative log density 


g(s) = [elas =— J Pils) gy = — logp;(s). (14.19) 


pi(s) 

Efficient FastICA [68] improves FastICA, and it can attain the Cramer-Rao 
bound. This result is rigorously proven under the assumption that the proba- 
bility distribution of the independent signal components belongs to the class of 
generalized Gaussian distributions with parameter a, denoted GG(q) for a > 2. 
The algorithm is about three times faster than that of symmetric FastICA. 

FastICA can be implemented in the deflation (or sequential extraction) and 
the symmetric (or simultaneous extraction) modes. In the deflation mode, the 
constraint of uncorrelatedness with the previously found sources is required to 
prevent the algorithm from converging to previously found components. In the 
symmetric mode, the components are estimated simultaneously. The deflation- 
based FastICA can estimate a single or a subset of the original independent 
components one-by-one, with reduced computational load, but errors can accu- 
mulate in successive deflation stages. Symmetric FastICA recovers all source 
signals simultaneously [45]. It is widely used in practice for BSS, due to its good 
accuracy and convergence speed. This algorithm shows local quadratic conver- 
gence to the correct solution with a generic cost function. For the kurtosis cost 
function, the convergence is cubic. Thus, the one-unit behavior generalizes to the 
parallel case as well. The chosen nonlinearity in the cost function has very little 
effect on the behavior. The score function of the sources, which is optimal for 
ML and minimum entropy criteria for ICA, seems not to offer any advantages 
for the speed of convergence. However, the true score function does minimize the 
residual error in the finite sample case [67]. 
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The local convergence properties of FastICA have been investigated under a 
general setting in [86]. Unfortunately, for FastICA, the major difficulty of local 
convergence analysis is due to the well-known sign-flipping phenomenon of Fas- 
tICA, which causes the discontinuity of the corresponding FastICA map on the 
unit sphere, and the approach taken is not mathematically rigorous. There have 
been a few attempts to generalize FastICA to solve the problem of independent 
subspace analysis [55]. In [98], by using the mathematical concept of principal 
fiber bundles, FastICA is proven to be locally quadratically convergent to a cor- 
rect separation. Higher-order local convergence properties of FastICA are also 
investigated in the framework of a scalar shift strategy. As a parallelized version 
of FastICA, QR FastICA [98], which employs the GSO process instead of the 
polar decomposition, shares similar local convergence properties with FastICA. 

The Huber M-estimator cost function is introduced as a contrast function for 
use within prewhitened BSS algorithms such as FastICA [35]. Key properties 
regarding the local stability of the algorithm for general non-Gaussian source 
distributions are established, and its separating capabilities are shown through 
analysis to be insensitive to the threshold parameter. The use of the Huber 
M-estimator cost as a criterion for successful separation of large-scale and ill- 
conditioned signal mixtures with reduced data set requirements. 

A family of flexible score functions for BSS is based on the family of gener- 
alized gamma densities. Flexible FastICA [66] uses FastICA to blindly extract 
the independent source signals, while an efficient ML-based method is used to 
adaptively estimate the parameters of such score functions. A FastICA algorithm 
suitable for the separation of quaternion-valued signals from an observed linear 
mixture is proposed in [57]. 


Example 14.1: Assume that we have five independent signals, as shown in 
Fig. 14.2a. If we have more than five input sources that are obtained by lin- 
early mixing the five sources, by using the ICA procedure, only five independent 
sources can be obtained. Figure 14.2b illustrates J; = 6 sources that are obtained 
by linearly mixing the five sources. We apply FastICA. After the iteration, WA 
becomes a 5 x 5 identity matrix. The separated sources are shown in Fig. 14.2c. 
The separated Jo = 5 signals are very close to the orginal independent signals, 
and there are ambiguities in the order of the separated independent components 
and in the amplitude of some independent components. In comparison with the 
original signals, some separated signals are multiplied by —1. When J; > J2, 
ICA can be used for both BSS and feature extraction. 


Example 14.2: We use the MATLAB package (http://research.ics.aalto. 
fi/ica/imageica/ [51]) for estimating ICA basis windows from image data. 
The statistical analysis is performed on 13 natural grayscale images. The model 
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(c) 
Figure 14.2 An example of ICA using FastICA. (a) Five independent sources. (b) Mixed sources. (c) 
Separated sources. 


size is set as 128 and we apply FastICA to get the ICA model. We implement 
ICA estimation for 200 iterations. For each input image patch x, x = As = 
[al a}... afos]! (s1,52,...,8s128)". The basis row vector a; are represented 
by a basis window of the same size of image patch, are localized both in space 
and in frequency, resembling the wavelets. The obtained basis windows of the 
model are shown in Fig. 14.3. The obtained basis windows can be used for image 


coding. 
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Figure 14.4 Architecture of the ICA network. 


14.5 


ICA networks 


ICA does not require a nonlinear network for linear mixtures, but its basis vec- 
tors are usually nonorthogonal and the learning algorithm must contain some 
nonlinearities for higher-order statistics. A three-layer J,-J2-J; linear autoasso- 
ciative network that is used for PCA can also be used as an ICA network, as long 
as the outputs of the hidden layer are independent. For the ICA network, the 
weight matrix between the input and hidden layers corresponds to the Jı x J2 
demixing matrix W, and the weight matrix from the hidden to the output layer 
corresponds to the Ja x Jı mixing matrix A. In [59], W is further factorized into 
two parts according to (14.13) and the network becomes a four-layer J1-J2-J2- 
Jı network, as shown in Fig. 14.4. The weight matrices between the layers are, 
respectively, V, U and A’. y;, i = 1,..., J2, is the ith output of the bottleneck 
layer, and ĉj, j =1,..., Jı, is the estimate of xj. 

Each of the three weight matrices performs one of the processing tasks required 
for ICA, namely, whitening, separation, and estimation of the basis vectors of 
ICA. It can be used for both BSS and estimation of the basis vectors of ICA, 
which is useful, for example, in projection pursuit. If the task is merely BSS, the 
last ICA basis vector estimation layer is not needed. 

The weights between the input and second layers perform prewhitening 


v(t) = Vo a. (14.20) 
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The weights between the second and third layers perform separation 
y(t) = U' v(t), (14.21) 
and the weights between the last two layers estimate the basis vectors of ICA 
z(t) = Ay(t). (14.22) 


When the three-layer linear autoassociative network is used for PCA, we have 
the relations ¢ = AW’ a and A= W(WT W) [9]. For PCA, WTW = I,,, 
thus @ = WWT rz. The ICA solution can be obtained by imposing the additional 
constraint that the components of the bottleneck layer output vector y = Wx 
must be mutually independent or as independent as possible. For ICA, « = 
AW’x=A (ATA) A’ z is the LS approximation [59]. Notice that A and W 
are the pseudoinverses of each other. 


Prewhitening 
By performing prewhitening, we have 


VT = AET, (14.23) 


where A = diag (\1,...,Ay,), E = [c1,...,€7,], with A; and c; as the ith largest 
eigenvalue and the corresponding eigenvector of the covariance matrix C. PCA 
can be applied to solve for the eigenvalues and eigenvectors. 

A simple local algorithm for learning the whitening matrix is given by [21] 


V(t +1) = Vit) — n() V(b) [v(t)v7 (t) - 1]. (14.24) 


It is used as part of the EASI separation algorithm [21]. This algorithm does not 
have any optimality properties in data compression, and it sometimes suffers from 
stability problems. The validity of the algorithm can be justified by observing 
the whiteness condition E [v(t)v7 (t)] = Ij, after convergence. 

Prewhitening usually makes separation algorithms converge faster and often 
have better stability properties. However, if the mixing matrix A is ill- 
conditioned, whitening can make separation of sources more difficult or even 
impossible [21, 59]. 

3-prewhitening [79] minimizes the empirical -divergence over the space of 
all Gaussian distributions. G-divergence reduces to Kullback-Leibler divergence 
when 8 — 0. @-prewhitening with 8 = 0 is equivalent to standard prewhitening, 
if the data set is not seriously corrupted by noises or outliers. For data sets 
seriously corrupted by noise or outliers, G-prewhitening with 8 > 0 is much better 
than standard prewhitening. 


Separating algorithms 

The separating algorithms can be based on robust PCA, nonlinear PCA, or 
bigradient nonlinear PCA [59]. W can also be calculated iteratively without 
prewhitening as in EASI [21] or generalized EASI [59]. The nonlinear PCA algo- 
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rithm is given as [59] 


U(t + 1) = UE) +n [et — Ug) 9 (y7), (14.25) 


where g(-) is usually selected to be odd for stability and separation reasons, such 
as g(t) = t?, and 7 > 0 slowly reduces to zero or is a small constant. 


Estimation of basis vectors 
The estimation of the basis vectors can be based on the LS solution 
=i 


&(t) = Ay(t)= W (WTW) y(t), (14.26) 
where A = lâi, â] =W (wrw) is a J, X J2 matrix. If prewhitening, 
(14.23) is applied, A can be simplified as 

A =EA?U. (14.27) 


Thus, the unnormalized ith basis vector of ICA is â; = EA? 4u; where u; is 
the ith column of U, and its squared norm becomes ||â;||? = uT Au;. Local 
algorithms for estimating the basis vectors can be derived by minimizing the 
MSE E|||x — Ay||”] using the gradient-descent method [59] 


A(t +1) = A(t) + ny(t) [xf — yT (HA(d)] . (14.28) 


For any of the last three layers of the ICA network, it is possible to use either a 
local or a nonlocal learning method. 

The quality of separation can be measured in terms of the performance index 
defined as [3]: 


P 
= cul [cxi] 
J= 2 (Sh ma  t > 2 A (14.29) 


where C = [cxi] = WA. The performance index is always nonnegative, and zero 
value means perfect separation. 


Some ICA methods 


Nonlinear ICA 


ICA algorithms discussed so far are linear ICA methods for separating original 
sources from linear mixtures. Blind separation of the original signals in nonlin- 
ear mixtures has many difficulties such as intrinsic indeterminacy, the unknown 
distribution of the sources as well as the mixing conditions, and the presence of 
noise. It is impossible to separate the original sources using only the source inde- 
pendence assumption of some unknown nonlinear transformations of the sources 
[48]. In the nonlinear case, however, ICA has an infinite number of solutions that 
are not related in any simple way to one another [48]. Nonlinear BSS is an ill- 
posed problem: further knowledge is applied to the problem through a suitable 
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form of regularization. In spite of these difficulties, some methods are effective 
for nonlinear BSS. Nonlinear ICA can be modeled by a parameterized neural net- 
work whose parameters can be determined under the criterion of independence 
of its outputs. 

The inverse of the nonlinear mixing model can be modeled by using the three- 
layer MLP [17, 2] or the RBF network [100]. SOM provides a parameter-free 
method to the nonlinear ICA problem [43], but suffers from the exponential 
growth of the network complexity with the dimensions of the output lattice. 

The minimal nonlinear distortion principle [112] tackles the ill-posedness of 
nonlinear ICA problems. It prefers the nonlinear ICA solution that is as close 
as possible to the linear solution, among all possible solutions. It also helps to 
avoid local optima in the solutions. To achieve minimal nonlinear distortion, 
a regularization term is exploited to minimize the MSE between the nonlinear 
mixing mapping and the best-fitting linear one. 


Constrained ICA 


ICA is an ill-posed problem because of the indeterminacy of scaling and per- 
mutation of the solution. The recovered independent components can have an 
arbitrary permutation of the original sources, and the estimated independent 
components may also be dilated from the originals [32]. Incorporation of prior 
knowledge and further requirements converts the ill-posed ICA problem into 
a well-posed problem. In some BSS applications, knowledge about the sources 
or the mixing channels may be available: for example, statistical properties of 
speeches or physical distances between the location of the microphones and the 
speakers in the cocktail-party problem. 

Constrained ICA is a framework that incorporates additional requirements 
and prior information in the form of constraints into the ICA contrast function 
[76]. The approach given in [75] sorts independent components according to 
some statistic and normalizes the demixing matrix or the energies of separated 
independent components. With some prior knowledge, the algorithm is able to 
identify and extract the original sources perfectly from their mixtures. Adaptive 
solutions using Newton-like learning are given in [76]. 

ICA with reference [77] extracts an interesting subset of independent sources 
from their linear mixtures when some a priori information of the sources is 
available in the form of rough templates (references), in a single process. A 
neural algorithm is proposed using a Newton-like approach to obtain an optimal 
solution to the constrained optimization problem. ICA with reference converges 
at least quadratically. 

Constrained ICA can be formulated in mutual information terms directly [103]. 
As an estimate of mutual information, a robust version of the Edgeworth expan- 
sion is used, on which gradient descent is performed. Another way of constraining 
ICA has been introduced in [1]. ICA is applied to two data sets separately. The 
corresponding dependent components between the two sets are then determined 
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using a natural gradient-type learning rule, thus performing an ICA-style gener- 
alization of CCA. 

Prior information on the sparsity of the mixing matrix can be a constraint [54]. 
In a biological interpretation, sparseness of mixing matrix means sparse connec- 
tivity of the neural network. The criterion functions for ICA can be modified if 
prior information about the entries of the mixing matrix is available [56]. 


Nonnegativity ICA 


The constraint of nonnegative sources, perhaps with an additional constraint of 
nonnegativity on the mixing matrix A, is often known as NMF. We refer to the 
combination of nonnegativity and independence assumptions on the sources as 
nonnegative ICA. 

Nonnegative PCA and nonnegative ICA algorithms are given in [91], where the 
sources s; are assumed to be nonnegative. The nonnegative ICA algorithms [91] 
are based on a two-stage process common to many ICA algorithms, prewhitening 
and rotation. However, instead of using the usual non-Gaussianity measures such 
as kurtosis in the rotation stage, the nonnegativity constraint is used. 

Some algorithms for nonnegative ICA [85], [91] are based on the assumption 
that the sources s; are well grounded except for independence and nonnegativity. 
We call a source s; well grounded if Pr(s; < €) > 0 for any € > 0, i.e., s; has 
nonzero pdf all the way down to zero. However, many real-world nonnegative 
sources are not well grounded, e.g., images. 

In [85], a gradient algorithm is derived from a cost function whose minimum 
coincides with nonnegativity under the whitening constraint, under which the 
separating matrix is orthogonal. In the Stiefel manifold of orthogonal matrices, 
the cost function is a Lyapunov function for the matrix gradient flow, implying 
global convergence [85]. A nonnegative PCA algorithm has good performance of 
separation [85], and a discrete-time version of the algorithm developed is shown to 
be globally convergent under certain conditions [109]. Nonnegative ICA proposed 
in [113] can work efficiently even when the source signals are not well grounded. 
This method is insensitive to the particular underlying distribution of the source 
data. 

Stochastic nonnegative ICA method of [7] minimizes mutual information 
between recovered components by using a nonnegativity-constrained simulated 
annealing algorithm. Convex analysis of mixtures of nonnegative sources [26] has 
been theoretically proven to achieve perfect separation by searching for all the 
extreme points of an observation-constructed polyhedral set. In [105], a joint 
correlation function of multiple signals confirms that the observations after non- 
negative mixing would have higher joint correlation than the original unknown 
sources. Accordingly, a nonnegative least-correlated component analysis method 
designs the unmixing matrix by minimizing the joint correlation function among 
the estimated nonnegative sources. The general algorithm is developed based on 
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an iterative volume maximization principle and LP. The source identifiability 
and required conditions are discussed and proven [105]. 


ICA for convolutive mixtures 


ICA for convolutive mixtures is computationally expensive for long FIR fil- 
ters because it includes convolution operations. Time-domain BSS applies ICA 
directly to the convolutive mixture model [4], [63]. The approach achieves good 
separation once the algorithm converges, since ICA correctly evaluates the inde- 
pendence of separated signals. 

Frequency-domain BSS applies complex-valued ICA for instantaneous mix- 
tures in each frequency bin [99], [80], [97]. ICA can be performed separately at 
each frequency. Also, any complex-valued instantaneous ICA algorithm can be 
employed. However, frequency-domain BSS involves a permutation problem: the 
permutation ambiguity of ICA in each frequency bin should be aligned so that 
a separated signal in the time domain contains frequency components of the 
same source signal. A robust and precise method is presented in [95] for solving 
the permutation problem, based on two approaches: direction of arrival (DoA) 
estimation for sources and the interfrequency correlation of signal envelopes. 

Independent vector analysis solves frequency-domain BSS effectively without 
suffering from the permutation problem between the frequencies by utilizing 
dependencies of frequency bins. It performs successfully under most conditions 
including the ill-posed condition such as the case where the mixing filters of the 
sources are very similar [64]. 

In an ML spatio-temporal BSS algorithm [44], the temporal dependencies are 
analyzed by assuming that each source is an autoregressive process and the dis- 
tribution is described using a mixture of Gaussians. Optimization is performed 
by using the EM method to maximize the likelihood, and the update equations 
have a simple, analytical form. The method has excellent performance for arti- 
ficial mixtures of real audio. 


Other methods 


The relative Newton method [114] is an exemplary relative optimization method, 
where the modified differential matrix is learned through Newton-type updates. 
A direct application of the trust-region method to ICA can be found in [27]. 
Relative trust-region learning [28] jointly exploits the trust-region method and 
relative optimization. The method finds a direction and a step size with the 
help of a quadratic model of the objective function and updates parameters in a 
multiplicative fashion. The resulting relative trust-region ICA algorithm achieves 
a faster convergence than the relative gradient and even Newton-type algorithms 
do. FastICA seemed a little bit faster than relative trust-region ICA. However, in 
the case of small data sets, FastICA does not work well, whereas relative trust- 
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region ICA works fine. Moreover, in an ill-conditioned high-dimensional data set, 
the relative trust-region converges much faster than FastICA does. 

When a linear mixture of independent sources is contaminated by multiplica- 
tive noise, the problems of BSS and feature extraction are highly complex. This 
is called the multiplicative ICA model. This noise commonly exists in coherent 
images as ultrasound, synthetic aperture radar or laser images. The approach 
followed by ICA does not produce proper results, because the output of a linear 
transformation of the noisy data cannot be independent. However, the statistic 
of this output possesses a special structure that can be used to obtain the origi- 
nal mixture. In [12], this statistical structure is studied and a general approach 
to solving the problem is stated. The FMICA method [13] obtains the unmixing 
matrix from a linear mixture of independent sources in the presence of multi- 
plicative noise, without any limitation in the nature of the sources or the noise. 
The statistical structure of a linear transformation of the noisy data is studied 
up to the fourth order, and then this structure is used to find the inverse of the 
mixing matrix through the minimization of a cost function. NPICA [16] uses a 
nonparametric kernel density estimation technique; it performs simultaneously 
the estimation of the unknown pdfs of the source signals and the estimation of 
the unmixing matrix. 

A recurrent network is described for performing robust BSS in [31]. Existing 
adaptive ICA algorithms with equivariant properties are extended to simultane- 
ously perform unbiased estimation of the separating matrix and noise reduction 
on the extracted sources. The optimal choice of nonlinear activation functions 
is discussed for various noise distributions assuming a generalized Gaussian- 
distributed noise model. 

RADICAL (Robust, Accurate, Direct ICA aLgorithm) is an ICA algorithm 
based on an efficient entropy estimator [70]. It directly minimizes the measure 
of departure from independence according to the estimated Kullback-Leibler 
divergence between the joint distribution and the product of the marginal dis- 
tributions. In particular, the entropy estimator used is consistent and exhibits 
rapid convergence. The estimator’s relative insensitivity to outliers translates 
into superior performance by RADICAL on outlier tests. RADICAL presents 
favorable comparisons to kernel ICA, FastICA, JADE and extended infomax. 

RobustICA [110] is a simple deflationary ICA method. It consists of performing 
exact line search optimization of the kurtosis contrast function. RobustICA can 
avoid prewhitening and deals with real- and complex-valued mixtures of possibly 
noncircular sources alike. The algorithm targets sub-Gaussian or super-Gaussian 
sources in the order specified by the user. RobustICA proves faster and more 
efficient than FastICA with asymptotic cubic global convergence. In the real- 
valued two-signal case, the algorithm converges in a single iteration. 

The minimax mutual information ICA algorithm [36] is an efficient and robust 
ICA algorithm motivated by the maximum entropy principle. The optimality 
criterion is the minimum output mutual information, where the estimated pdfs 
are from the exponential family and are approximate solutions to a constrained 
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entropy maximization problem. This approach yields an upper bound for the 
actual mutual information of the output signals. One approach that is commonly 
taken in designing information-theoretic ICA algorithms is to use some form of 
polynomial expansion to approximate the pdf of the signals [32], [3], [46]. A 
sequential, fixed-point subspace ICA algorithm given in [102] is based on the 
gradient of a robust version of the Edgeworth expansion of mutual information 
and a constrained ICA version is also based on goal programming of mutual 
information objectives. It performs comparably to robust FastICA and infomax, 
and much better than JADE. 

In [24], the connections between mutual information, entropy and non- 
Gaussianity in a larger framework are explored without resorting to a somewhat 
arbitrary decorrelation constraint. A key result is that the mutual information 
can be decomposed, under linear transforms, as the sum of two terms: one term 
expressing the decorrelation of the components and the other expressing their 
non-Gaussianity. 

Second-order methods for BSS cannot be easily implemented by neural models. 
Results on the nonlinear PCA criterion in BSS [61] clarify how the nonlinearity 
should be chosen optimally. The connections of the nonlinear PCA learning rule 
with the infomax algorithm and the adaptive EASI algorithm are also discussed 
in [61]. A nonlinear PCA criterion can be minimized using LS approaches, leading 
to computationally efficient and fast converging algorithms. 

ICA of sparse signals (sparse ICA) may be done by a combination of a clus- 
tering algorithm and PCA [8]. The final algorithm is easy to implement for any 
number of sources. This, however, requires an exponential growing of the sample 
number as the number of sources increases. 

Localized ICA is used to characterize nonlinear ICA [62]. Clustering is first 
used for an overall coarse nonlinear representation of the underlying data and 
linear ICA is then applied in each cluster so as to describe local features of the 
data. The data are grouped in several clusters based on the similarities between 
the observed data ahead of the preprocessing of linear ICA using some clustering 
algorithms. This leads to a better representation of the data than in linear ICA 
in a computationally feasible manner. 

In practice, the estimated independent components are often not at all inde- 
pendent. This residual dependence structure could be used to define a topo- 
graphic order for the components [53]. A distance between two components could 
be defined using their higher-order correlations, and this distance could be used 
to create a topographic representation. Thus, we obtain a linear decomposition 
into approximately independent components, where the dependence of two com- 
ponents is approximated by the proximity of the components in the topographic 
representation. Topographic ICA can be considered a generalization of another 
modification of the ICA model: independent subspace analysis [51]. 

A linear combination of the separator output fourth-order marginal cumu- 
lants (kurtoses) is a valid contrast function for eliminating the permutation 
ambiguity of ICA with prewhitening. The analysis confirms that the method 
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presented in [32], despite arising from the mutual information principle, presents 
ML-optimality features [111]. 

The BSS problem can be solved in a single step by the CCA approach [96]. 
With CCA, the objective is to find a transformation matrix, which when applied 
to the mixtures, maximizes the autocorrelation of each of the recovered signals. 

Separation of independent sources using ICA requires prior knowledge of the 
number of independent sources. In the overcomplete situation, performing ICA 
can result in incorrect separation and poor quality. Undercomplete situation is 
often encountered for applications such as sensor networks, where the numbers of 
sensors may often exceed the number of components such as in sensor networks 
for environmental or defense monitoring, or when the components are not inde- 
pendent. Normalized determinant of the global matrix |G| = [WA] is a measure 
of the number of independent sources in a given mixture, N, in a mixture of M 
recordings [81]. 

In [89], a contrast for BSS of natural signals is proposed, which measures the 
algorithmic complexity of the sources and also the complexity of the mixing 
mapping. The approach can be seen as an application of the MDL principle. 
The complexity is then taken as the length of the compressed signal in bits. 
No assumption about underlying pdfs of the sources is necessary. Instead, it 
is required that the independent source signals have low complexity, which is 
generally true for natural signals. Minimum mutual information coincides with 
minimizing complexity in a special case. The complexity minimization method 
gives clearly more accurate results for separating correlated signals than the 
reference method utilizing ICA does. It can be applied to nonlinear BSS and 
nonlinear exploratory projection pursuit. 


Complex-valued ICA 


ICA for separating complex-valued sources is needed for convolutive source sepa- 
ration in the frequency domain, or for performing source separation on complex- 
valued data, such as {MRI or radar data. Split complex infomax [99] uses non- 
analytic nonlinearity, since the real and imaginary values are split into sepa- 
rate channels. Fully-complex infomax [18] simply uses an analytic (and hence 
unbounded) complex nonlinearity for infomax for processing complex-valued 
sources. When compared to split complex approaches, the shape of the per- 
formance surface is improved resulting in better convergence characteristics. 

In the complex ICA model, all sources s; are zero-mean and have unit variance 
with uncorrelated real and imaginary parts of equal variance. That is, E[ss”] = I 
and E[ss’] = O. For algorithms such as JADE, the extension to the complex 
case is straightforward due to the algorithm’s use of fourth-order cumulants. An 
Edgeworth expansion is used in [32] to approximate negentropy based on third- 
and fourth-order cumulants, and hence again, it can be relatively easily applied 
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to the complex case. However, using higher-order cumulants typically results in 
an estimate sensitive to outliers. 

The theory of complex-weighted network learning has led to effective ICA 
algorithms for complex-valued statistically independent signals [39], [40]. The 
-APEX algorithms and GHA are, respectively, extended to the complex-valued 
case [39, 40]. Based on a suitably selected nonlinear function, these algorithms 
can be used for BSS of complex-valued circular source signals. 

FastICA has been extended to complex-valued sources, leading to c-FastICA 
[11]. c-FastICA is shown to keep the cubic global convergence property of its 
real counterpart [92]. c-FastICA, however, is only valid for second-order circular 
sources. Stability analysis shows that practically any nonquadratic even function 
can be used to construct a cost function for ICA through non-Gaussianity max- 
imization [47]. This observation is extended to complex sources in c-FastICA by 
using the cost function [11] 


J(w)=E [o (w*z|*)] (14.30) 


where g(-) is a smooth even function, e.g., gly) = y>. 

Recent efforts extend the usefulness of the algorithm to noncircular sources 
[34], [83], [82], [72]. Complex ICA can be performed by maximization of the 
complex kurtosis cost function using gradient update, fixed-point update, or 
Newton update [72]. FastICA is also derived in [34] for the blind separation 
of complex-valued mixtures of independent, noncircularly symmetric, and non- 
Gaussian source signals on a kurtosis-based contrast. In [83], the whitened obser- 
vation pseudocovariance matrix is incorporated into the FastICA update rule to 
guarantee local stability at the separating solutions even in the presence of non- 
circular sources. For kurtosis-based nonlinearity, the resulting algorithm bears 
close resemblance to that derived in [34] through an approach sparing differentia- 
tion. Similar algorithms are proposed in [82] through a negentropy-based family 
of cost functions preserving phase information and thus adapted to noncircu- 
lar sources. Both a gradient-descent and a quasi-Newton algorithm are derived 
by using the full second-order statistics, providing superior performance with 
circular and noncircular sources. 

The kurtosis or fourth-order cumulant of a zero-mean complex random variable 
is defined as a real number [23] 


kurt(y) = cum (y,y*,y,y*) = E [lyl*] - 2 (E [lu -|E[y*]|?, 4.31) 


and can be shown to be zero for any complex Gaussian variable, circular or 
noncircular. This result also implies that any source with zero kurtosis will not 
be separated well under this criterion as well as noncircular Gaussian sources, 
which can be separated using ML [25] or the strongly-uncorrelating transform 
algorithm [37]. 

A linear transformation called strong-uncorrelating transform [37] uses second- 
order statistics information through the covariance and pseudocovariance matri- 
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ces and performs ICA by joint diagonalization of these matrices. Although 
efficient, the algorithm restricts the sources to be noncircular with distinct 
spectra of the pseudocovariance matrix. It can be viewed as an extension of 
the conventional whitening transform for complex random vectors. The strong- 
uncorrelating transform is just the ordinary whitening transform for a second- 
order circular complex random vector (E[ss?] = O). The method can be used as 
a fast ICA method for complex signals. It is able to separate almost all mixtures, 
if the sources belong to a class of complex non-circular random variables. Strong- 
uncorrelating transform is used as a prewhitening step in some ICA algorithms, 
e.g., in [34]. 

Extending the theorems proved for the real-valued instantaneous ICA model 
[32], theorems given in [38] states the conditions for identifiability, separability, 
and uniqueness of complex-valued linear ICA models. Both circular (proper) and 
noncircular complex random vectors are covered by the theorems. The conditions 
for identifiability and uniqueness are sufficient and the separability condition is 
necessary. 

In [74], natural gradient complex ML ICA update rule and its variant with 
a unitary constraint on demixing matrix, as well as a Newton algorithm are 
derived. The conditions for local stability are derived using a generalized Gaus- 
sian density source model. 

Complex ICA by entropy-bound minimization [73] uses an entropy estimator 
for complex random variables by approximating the entropy estimate using a 
numerically computed maximum entropy bound and a line search optimization 
procedure. It has superior separation performance and computational efficiency 
in separation of complex sources that come from a wide range of bivariate dis- 
tributions. 

Generalized uncorrelating transform [87] is a generalization of the strong- 
uncorrelating transform [38] based on generalized estimators of the scatter 
matrix and spatial pseudo-scatter matrix. It is a separating matrix estimator 
for complex-valued ICA when at most one source random variable possess cir- 
cularly symmetric distribution and sources do not have identical distribution. 


Stationary subspace analysis and slow feature analysis 


Stationary subspace analysis 

In many settings, the observed signals are a mixture of underlying stationary and 
non-stationary sources. Stationary subspace analysis decomposes a multivariate 
time series into its stationary and nonstationary parts [104]. The observed time 
series z(t) is generated as a linear mixture of stationary source s*(t) and non- 
stationary source s”(t) with a time-constant mixing matrix A, 


TER | | (14.32) 


s(t) 
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and the objective is to recover these two groups of underlying sources given only 
samples from a(t). Stationary subspace analysis can be used for change-point 
detection in high-dimensional time series [14]. The dimensionality of the data 
can be reduced to the most nonstationary directions, which are most informative 
for detecting state changes in the time series. 

Analytic sationary subspace analysis [42] solve a generalized eigenvalue prob- 
lem. The solution is guaranteed to be optimal under the assumption that the 
covariance between stationary and non-stationary sources is time-constant. Ana- 
lytic sationary subspace analysis finds a sequence of projections, ordered by their 
degree of stationarity. It is more than 100 times faster than the Kullback-Leibler 
divergence-based method [104]. 


Slow feature analysis 

Slow feature analysis [106] aims for extracting temporally coherent features out 
of high dimensional and/or delayed sensor measurements. Let {æ+}; C ¥ bea 
sequence of n observations. It finds a set of mappings ¢; : ¥ — R,i=1,...,p, 
such that ¢;(a,) changes slowly over time. The updating complexity is cubic 
with respect to the input dimensionality. 

Incremental slow feature analysis [69] combines candid covariance-free incre- 
mental PCA and covariance-free incremental MCA. It has simple Hebbian and 
anti-Hebbian updates with a linear complexity in terms of the input dimension- 
ality. Regularized sparse kernel slow feature analysis generates an orthogonal 
basis in the unknown latent space for a given real-world time series by utilizing 
the kernel trick in combination with sparsification [15]. In terms of classifica- 
tion accuracy, the superiority of kernel slow feature analysis over kernel PCA is 
demonstrated in encoding latent variables in [15]. 


EEG, MEG and fMRI 


The human brain exhibits relevant dynamics on all spatial scales, ranging from 
a single neuron to the entire cortex. Extending Hodgkin-Huxley neuron model 
from a patch of cell membrane to whole neurons and to populations of neurons 
in order to predict macroscopic signals such as EEG is a dominant focus in this 
field. Brain-computer interfaces translate brain activities into control signals for 
devices like computers, robots, and so forth. They have a huge potential in 
medical and industrial applications for both disabled and normal people, where 
the learning burden has shifted from a subject to a computer. Experimental 
and theoretical studies of functional connectivity in humans require non-invasive 
techniques such as EEG, MEG, ECG and fMRI. High-density EEG and/or MEG 
data model the event-related dynamics of many cortical areas that contribute 
distinctive information to the recorded signals. 

EEG and MEG provide the most direct measure of cortical activity with high 
temporal resolution (1 ms), but with spatial resolution (1-10 cm) limited by the 
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locations of sensors on the scalp. In contrast, fMRI has low temporal resolution 
(1-10 s), but high spatial resolution (1-10 mm). To the extent that functional 
activity among brain regions in the cortex may be conceptualized as a large-scale 
brain network with diffuse nodes, fMRI may delineate the anatomy of these 
networks, perhaps most effectively in identifying major network hubs. These 
technologies provide a complete view of dynamical brain activity both spatially 
and temporally. 

EEG data consists of recordings of electrical potentials in many different loca- 
tions on the scalp. These potentials are presumably generated by mixing some 
underlying components of brain activity. Automatic detection of seizures in the 
intracranial EEG recordings is implemented in [107], [108]. ECG recordings con- 
tain contributions from several bioelectric phenomena which include maternal 
and fetal heart activity and various kinds of noise. fMRI determines the spatial 
distribution of brain activities evoked by a given stimuli in a noninvasive manner, 
for the study of cognitive function of the brain. fMRI provides only an indirect 
view of neural activity via the blood oxygen level dependent (BOLD) functional 
imaging in primary visual cortex. 

ICA has become an important tool to untangle the components of signals 
in multi-channel EEG data. Subjects wear a cap embedded with a lattice of 
EEG electrodes, which record brain activity at different locations on the scalp. 
Stochastically spiking neurons with refractoriness could in principle learn in 
an unsupervised manner to carry out both information bottleneck optimiza- 
tion and the extraction of independent components [65]. Suitable learning rules 
are derived, which simultaneously keep the firing rate of the neuron within a 
biologically realistic range. 

In [33], extended infomax [71], FastICA, JADE in a MATLAB-based tool- 
box, group ICA of fMRI toolbox (GIFT) (http://icatb.sourceforge.net), 
are compared in terms of {MRI analysis, incorporating the implementations from 
ICALAB toolbox (http://www.bsp.brain.riken. jp/ICALAB). fMRI is a tech- 
nique that produces complex-valued data. 


Example 14.3: EEGLAB is MATLAB-based software for processing continuous 
or event-related EEG or other physiological data. We illustrate EEG principle 
by using a dataset provided by EEGLAB package. There are 32 channels for 
EEG measurements, and the sampling rate is 128 Hz. The dataset as well as 
their measurement locations is shown in Fig. 14.5. ICA can be used to separate 
out several important types of non-brain artifacts from EEG data — only those 
associated with fixed scalp-amp projections, including eye movements and eye 
blinks, temporal muscle activity and line noise. 
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Figure 14.5 An EEG dataset using 32 channels and the channel locations on the human scalp. Two of 
the channel locations are in the front of the head. 


14.1 Make a comparison between PCA and ICA. 


14.2 BSS can be performed by ICA or decorrelation. Explain the two 
approaches. 


14.3 In textile industry, one needs to monitor the uniformness of the thread 
radius. A solution is to let the thread pass through a capacitor and measure the 
capacitance variation. 

(a) Derive the sensitivity of the capacitance change on the thread radius change. 
(b) The noise is much higher than the capacitance variation caused by the 
nonuniformness of the thread. Consider how to reduce its influence. 


14.4 Consider three independent sources: 


sı(n) = 0.5 sin(20n) cos(10n), 
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s2(n) = 0.01 exp(sin(107) cos(2n) + cos(16n), 


and s3(n) is a random noise, uniformly drawn from [—1, 1]. The mixing matrix 


0.30 3.1 —1.60 
A = | —0.65 0.49 3.54 
0.25 0.42 —0.68 


Select an ICA algorithm to solve for the demixing matrix W. Plot the waveforms 
produced and compare them with the source signals. 


14.5 Two image patches of size 320 x 320 are selected from a set of images of 
natural scenes [51], and down-sampled by a factor of 4 in both directions to yield 
80 x 80 images. The third image is an artificial one containing only noisy signals. 
Each of the images is treated as one source with 6400 pixel samples. The three 
sources are then mixed using a randomly chosen mixing matrix 


0.8762 0.2513 0.3564 
A= 0.2864 —0.5615 0.3241 
—0.3523 0.7614 0.5234 


Recover the pictures. 


14.6 Four digitized, gray-scale facial images are used as the source signals. The 
images are linearly mixed using a randomly chosen square mixing matrix 


0.6829 —0.4246 1.8724 0.8260 
1.2634 1.4520 —0.5582 0.7451 
—0.7754 0.3251 0.5721 1.3774 
—0.7193 1.2051 0.2823 0.6821 


Separate these images using an ICA algorithm. 
14.7 Consider the application of ICA to wireless sensor networks. 


14.8 Demo of real-room blind separation/deconvolution of two speech 
sources are available at http://cnl.salk.edu/~tewon/Blind/blind_audio. 
html. Synthetic benchmarks for speech signals are available at http: //sound. 
media.mit.edu/ica-bench/. Select two speech signals from these websites. Use 


the instantaneous linear mixtures of the two speech signals. The mixing matrix 
Kx ie 0.85 


0.72 0.1 s . Reconstruct the original speech signals from the mixed signals. 


14.9 The ECG-recodings of a pregnant woman can be downloaded from 
the SISTA Identification Database (http://homes .esat .kuleuven.be/~tokka/ 
daisydata.htm1). The data contains eight channels of recordings, where the first 
five channels record the abdominal measure and the last three channels serve as 
the thoracic measure. Separate the signals of independent sources from the mea- 
surements. 


14.10 Investigate the use of ICA for edge detection of an image. 
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15.1 


Discriminant analysis 


Linear discriminant analysis 


Discriminant analysis plays an important role in statistical pattern recognition. 
LDA, originally derived by Fisher, is one of the most popular discriminant analy- 
sis techniques. Under the assumption that the class distributions are identically 
distributed Gaussians, LDA is Bayes optimal [44]. Like PCA, LDA is widely 
applied to image retrieval, face recognition, information retrieval, and pattern 
recognition. 

LDA is a supervised dimension-reduction technique. It projects the data into 
an effective low-dimensional linear subspace while finding directions that max- 
imize the ratio of between-class scatter to within-class scatter of the projected 
data. In the statistics community, LDA is equivalent to a t-test or F-test for 
significant difference between the mean of discriminants for two sampled classes; 
in fact, the statistic is designed to have the largest possible value [42]. LDA uti- 
lizes EVD to find an orientation which projects high-dimensional feature vectors 
of different classes to a low-dimensional space in the most discriminative way 
for classification. C-means can be used to generate cluster labels, which can be 
further used for LDA to do subspace selection [13]. 

LDA creates a linear combination of the given independent features that yield 
the largest mean differences between the desired classes [15]. Given a data set 
{x;} of size N, which is composed of J,-dimensional vectors, for all the samples 
of all the C classes, the within-class scatter matrix S,,, the between-class scatter 
matrix S,, and the mixture or total scatter matrix S, are, respectively, defined 
by 


1 C Ni T 
TAa (x! G n) (a! G) -= m) , (15.1) 
1 < T 
Ss = 7 DNs (m; — u) (m -H) > (15.2) 
S D ) (æ; — u)” (15.3) 
+= wd, u) (zj= u) , : 
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D is the ith sample of class j, 4; is the mean of class j, Nj is the 
number of samples in class j, and u represents the mean of all classes. Note that 


D N; = N. All the scatter matrices are of size Jı x Jı, and are related by 


S; = Su + Se. (15.4) 


where x 


The objective of LDA is to maximize the between-class measure while mini- 
mizing the within-class measure after applying a Jı x Jo transform matrix W, 
J2 being the number of features, Jı > J2, which transforms the J, x Jı scatter 
matrices into Jog x Jo matrices Sw, Sp, and Si, 


Su = W'S, W, 
Se = W'S,W, 
S, = W'S, W. (15.5) 


tr (S,,) measures the closeness of the samples within the clusters and tr (Sẹ) 
measures the separation between the clusters, where tr(-) is trace operator. An 
optimal W should preserve a given cluster structure, and simultaneously maxi- 
mize tr(S,) and minimize tr(S,,). This is equivalent to maximizing [43] 


Erpa (W) = tr (5715+) $ (15.6) 


when Sw is a nonsingular matrix. 
Assuming that Sẹ is a nonsingular matrix, one can maximize the Rayleigh 
coefficient [44] 


w! S,w 


Erpa (w) = (15.7) 


~ WwTSuyw 
to find the principal projection direction w1. Conventionally, the following 


Fisher’s determinant ratio criterion is maximized for finding the projection direc- 
tions [11, 28] 


_ det(S,) _ det (W7S,W) 


E Was a aa 
LDA,3( ) det(S,,) det (WTS,,W) ’ 


(15.8) 
where the column vectors w;, i =1,..., J2, of the projection matrix W, are the 
first Jz principal eigenvectors of S7}! S». 

LDA is equivalent to ML classification assuming normal distribution for each 
class with a common covariance matrix [25]. When each class has more complex 
structure, LDA may fail. 

LDA has O(NJəm +m?) cubic computational complexity and requires 
O(NJ2 + Nt + Jam) memory, where m = min(N, J2). It is infeasible to apply 
LDA when both N and Jz are large. 

LDA is targeted to find a set of weights w and a threshold 0 such that the 
discriminant function 


t(a;) = wz; — 0 (15.9) 


ww ai bbt.com DOOOO00 


Discriminant analysis 471 


X2 X2 














Figure 15.1 (a) The line joining the centroids defines the direction of greatest centroid spread, but the 
projected data overlap because of the covariance (left). The discriminant direction minimizes this 
overlap for Gaussian data (right). (b) The projections of PCA and LDA for a data set. 


maximizes a discrimination criterion such that the between-class variance is 
maximized relative to the within-class variance. The between-class variance is 
the variance of the class means of {t(x;)}, and the within-class variance is the 
pooled variance about the means. Figure 15.1a shows why this criterion makes 
sense. 

In the two-class problem, a data vector x; is assigned to one class if t(a;) > 0 
and to the other class if t(a;) < 0. Methods for determining w and 0 can be 
the perceptron, LDA and regression. The simplicity of the LDA model makes 
it a good candidate for classification in situations where training data are very 
limited. 

An illustration of PCA and LDA for a two-dimensional data set is shown 
in Fig. 15.1b. Two Gaussian classes Cı and C2 are represented by two ellipses. 
The principal direction obtained from PCA, namely, w?@4, cannot discriminate 
the two classes, while wD, the principal direction obtained from LDA, can 
discriminate the two classes. It is clearly seen that PCA is purely descriptive, 
while LDA is discriminative. 


Example 15.1: We compare PCA and LDA by using STPRtool (http://cmp. 
felk.cvut.cz/cmp/software/stprtool/). LDA and PCA are trained on the 
synthetical data generated from a Gaussian mixture model. The LDA and PCA 
directions are shown in Fig. 15.2a. The extracted data using LDA and PCA 
are shown with the Gaussians fitted by the ML method (see Fig. 15.2b). It is 
indicated that LDA effectively separates the samples whereas PCA fails. 


For the binary classification problem, LDA has been shown to be equivalent 
to regression [25]. This relation is extended to the multiclass case [3]. By using 
spectral graph analysis, spectral regression discriminant analysis [3] casts dis- 
criminant analysis into a multiple linear regression framework that facilitates 
both efficient computation and the use of regularization techniques. Specifically, 
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Figure 15.2 PCA vs. LDA. (a) The input data and the found LDA and PCA directions. (b) The 
class-conditional Gaussians estimated from data projected onto LDA and PCA directions. 


15.1.1 


the method only needs to solve a set of regularized LS problems. This allows for 
additional constraints (e.g. sparsity) on the LDA solution. It can be computed 
with O(ms) time and O(ms) memory, where s(< n) is the average number of 
nonzero features in each sample. 

Based on a single-layer linear feedforward network, LDA algorithms are also 
given in [5, 12]. The Q~!/? network [6] is another neural network based LDA. This 
algorithm adaptively computes Q~!/2, where Q is the correlation or covariance 
matrix. LDA is also extended to regression problems [33]. 


Solving small sample size problem 


Since there are at most C — 1 nonzero generalized eigenvalues for the LDA prob- 
lem, an upper bound on Jz is C—1. The rank of Sẹ is at most N — C and 
thus, at least N = J; + C samples are needed to guarantee S,, to be nonsingu- 
lar. This requirement on the number of samples may be severe for some problems 
like image processing. Typically, the number of images from each class is consid- 
erably limited in face recognition: only several faces can be acquired from each 
person. The dimension of the sample space is typically much larger than the 
number of the samples in a training set. For instance, an image of 32-by-32 pix- 
els is represent by a 1,024-dimensional vector, and consequently S,, is singular 
and LDA cannot be applied directly. This problem is known as the small sample 
size, singularity or undersampled problem. 

Pseudoinverse LDA [55] is based on the pseudoinverse of the scatter matrices. 
It applies the eigen-decomposition to the matrix S!S.., Sİ Se, or SÍS}. The 
criterion F; is an extension of the classical one (15.6), with the inverse of a 
matrix replaced by the pseudoinverse: 


max F, (W) = tr (SiS) l (15.10) 
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The pseudoinverse-based methods are competitive compared to regularized LDA 
[17] and Fisherfaces [59]. 


Fisherfaces 


PCA is known as the eigenfaces method when it is applied to a large set of 
image depicting different human faces. Eigenfaces are a set of standardized face 
ingredients, and any human face can be considered to be a combination of these 
standard faces. This technique is also used for handwriting analysis, lip read- 
ing, voice recognition, sign language/hand gestures interpretation and medical 
imaging analysis. Therefore, some prefer to call this methd as eigenimage. 

A common way to deal with the small-sample-size problem is to apply an 
intermediate dimension-reduction stage, such as PCA, to reduce the dimension 
of the original data before LDA is applied. This method is known as PCA+LDA 
or Fisherfaces [59], [2]. In order to avoid the complication of singular Sẹ, the 
Fisherfaces method discards the smallest principal components. The overall per- 
formance of the two-stage approach is sensitive to the reduced dimension in the 
first stage, and the optimal value of the reduced dimension for PCA is difficult 
to determine. The method discards the discriminant information contained in 
the null space of the within-class covariance matrix. 

When the training data set is small, the eigenfaces method outperforms the 
Fisherfaces method [41]. It might be because the Fisherfaces method uses all 
the principal components, but the components with the small eigenvalues corre- 
spond to high-frequency components and usually encode noise [38]. In line with 
this, two enhanced LDA models improve the generalization capability of LDA 
by decomposing the LDA procedure into simultaneous diagonalization of Sẹ and 
S, [38]. The simultaneous diagonalization is stepwisely equivalent to two opera- 
tions: whitening Są and applying PCA on S, using the transformed data. As an 
alternative to the Fisherfaces method, a two-stage LDA [77] avoids computation 
of the inverse Su by decomposing S}! Se. 


Example 15.2: We have implemented the eigenfaces method in Example 12.3. 
We now implement the Fisherfaces method for solving the same face recognition 
problem by using the same training and test sets. In this example, we randomly 
select N = 60 samples from C = 30 persons, 2 samples for each person, for train- 
ing. The test set includes 30 samples, one for each person. 

At first, a centered image in vector form is mapped onto an (N — C)- 
dimensional linear subspace by PCA weight matrix, and the output is further 
projected onto a (C — 1)-dimensional linear subspace by Fisher weight matrix, 
so that images of the same class move closer together and images of difference 
classes move further apart. 
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After applying the two-step procedure on the training set, the corresponding 
weights are called Fisherfaces. When a new sample is presented, its projection on 
the weights is compared with those of all the training samples, and the training 
sample that has the minimum difference from the test sample is classified as the 
decision. On the test samples, the classification rate is 86.67%. It is found that 
for this small training set, the classification accuracy of the eigenfaces method 
is better than that of the Fisherfaces method. 


Regularized LDA 


Another common way is to add some constant value u >0 to the diagonal 
elements of Su, as Sy + ula, where Iq is an identity matrix [17]. Sù + pIa is 
positive-definite, hence nonsingular. This approach is called regularized LDA. 
Regularization reduces the high variance related to the eigenvalue estimates of 
Sw, at the expense of potentially increased bias. The optimal value of the regu- 
larization parameter ju is difficult to determine, and crossvalidation is commonly 
applied for estimating the optimal u. By adjusting u, a set of LDA variants are 
obtained, such as DLDA [40] for u = 1. The tradeoff between variance and bias, 
depending on the severity of the small-sample-size problem, is controlled by the 
strength of regularization. 

In regularized LDA, an appropriate regularization parameter is selected from 
a given parameter candidate set by using cross-validation for classification. In 
regularized orthogonal LDA, the regularization parameter is selected by using a 
mathematical criterion [9]. 

Quadratic discriminant analysis models the likelihood of each class as a Gaus- 
sian distribution, then uses the posterior distributions to estimate the class for 
a given test point [25]. The Gaussian parameters can be estimated from train- 
ing points with ML estimation. Unfortunately, when the number N of training 
samples is small compared to the dimension d of each training sample, the ML 
covariance estimation can be ill-posed. 

Regularized quadratic discriminant analysis [17] shrinks the covariances of 
quadratic discriminant analysis toward a common covariance to achieve a 
compromise between LDA and quadratic discriminant analysis. Regularized 
quadratic discriminant analysis performs well when the true Gaussian distri- 
bution matches one of their regularization covariance models (e.g., diagonal, 
identity), but can fail when the generating distribution has a full covariance 
matrix, particularly when features are correlated. The single-parameter regu- 
larized quadratic discriminant analysis algorithm [8] reduces the computational 
complexity from O(N?) to O(N). Bayesian quadratic discriminant analysis [56] 
performs similar to ML quadratic discriminant analysis in terms of error rates. 
Its performance is very sensitive to the choice of the prior. 
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In penalized discriminant analysis [24], a symmetric and positive-semidefinite 
penalty matrix A is added to Sw. Flexible discriminant analysis [23] extends LDA 
to the nonlinear and multiclass classification via a penalized regression setting. 
To do that, it reformulates the discriminant analysis problem as a regression one 
and, then, uses a nonlinear function to fit the data. It is based on encoding the 
class labels into response scores and then a nonparametric regression technique, 
such as neural networks, is used to fit the response scores. 

Null-space LDA [7] attempts to solve the small-sample-size problems directly. 
The null space of Sẹ contains useful discriminant information. The method first 
projects the data onto the null space of S,, and, it then applies PCA to max- 
imize S, in the transformed space. Based on the eigen-decomposition of the 
original scatter matrices, null-space LDA may ignore some useful information 
by considering the null space of Sẹ only. The discriminative common vector 
method [4] addresses computational difficulties encountered in null-space LDA. 
In a fast implementation for null-space based LDA [10], the optimal transforma- 
tion matrix is obtained by orthogonal transformations by using QR factorization 
and QR factorization with column pivoting of the data matrix. 

Gradient LDA [53] is based on gradient-descent method but the convergence 
is fast and reliable. It does not discard any null spaces of Sẹ and Sẹ, matrices 
and thus preserves discriminative information which is useful for classification. 

Weighted piecewise LDA [34] first creates subsets of features and applies LDA 
to each subset. It then combines the resulting piecewise linear discriminants to 
produce an overall solution. Initially, a set of weighted piecewise discriminant 
hyperplanes are used in order to provide a more accurate discriminant decision 
than the one produced by LDA. 

Linear discriminants are computed by minimizing the regularization functional 
[50]. The common regularization technique [17] for resolving the singularity prob- 
lem is well justified in the framework of statistical learning theory. The resulting 
discriminants capture both regular and irregular information, where regular dis- 
criminants reside in the range space of Sw, while irregular discriminants reside 
in the null space of S,,. Linear discriminants are computed by regularized LS 
regression. The method and its nonlinear extension belong to the same frame- 
work where SVMs are formulated. 


Uncorrelated LDA and orthogonal LDA 


Uncorrelated LDA [30] computes the optimal discriminant vectors that are S,- 
orthogonal. It extracts features that are uncorrelated in the dimension-reduced 
space. It overcomes the small-sample-size problem by optimizing a generalized 
Fisher criterion. If there are a large number of samples in each class, uncorrelated 
LDA may overfit noise in the data. 
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The solution to uncorrelated LDA can be found by optimizing [67]: 
W= arg max{tr((W"S,W)'W7S,W)}, (15.11) 


where superscipt f denotes the pseudoinverse. Specifically, suppose that r vectors 
W1, W2,..., Wr are obtained, then the (r+ 1)-th vector w,+41 of uncorrelated 
LDA is the one that maximizes the Fisher criterion function E_pa.2(w) given 
by (15.7) subject to the constraints 


we, ,Siw;=0, i=1,...,r. 15.12 
r+1 


The algorithm given in [30] finds w; successively. 

The uncorrelated LDA transformation maps all data points from the same 
class to a common vector. Both regularized LDA and PCA+LDA are regular- 
ized versions of uncorrelated LDA. A unified framework for generalized LDA 
[29] elucidates the properties of various algorithms and their relationships via a 
transfer function. 

To overcome the rank limitation of LDA, the Foley-Sammon optimal discrimi- 
nant vectors method [16] aims to find an optimal set of orthonormal discriminant 
vectors that maximize the Fisher discriminant criterion under the orthogonal 
constraint. The Foley-Sammon method outperforms classical LDA in the sense 
that it can obtain more discriminant vectors for recognition, but its solution 
is more complicated than other LDA methods. The multiclass Foley-Sammon 
method can only extract the linear features of the input patterns, and the algo- 
rithm can be based on subspace decomposition [48] or be an analytic method 
based on Lagrange multipliers [14]. The Foley-Sammon method does not show 
good performance when having to deal with nonlinear patterns, such as face 
patterns. Both uncorrelated LDA and Foley-Sammon LDA use the same Fisher 
criterion function, and the main difference is that the optimal discriminant vec- 
tors generated by uncorrelated LDA are S,-orthogonal to one another, while 
the optimal discriminant vectors of Foley-Sammon LDA are orthogonal to one 
another. 

An uncorrelated optimal discrimination vectors method [30] uses the constraint 
of statistical uncorrelation. Orthogonal LDA [14] enforces W in Fisher’s criterion 
(15.8) to be orthogonal: WTW =I. Orthogonal LDA [67] provides a simple and 
efficient way for computing orthogonal transformations in the framework of LDA. 
The discriminant vectors of orthogonal LDA are orthogonal to one another, i.e., 
the transformation matrix of orthogonal LDA is orthogonal. Orthogonal LDA 
often leads to better performance than uncorrelated LDA in classification. The 
features in the reduced space of uncorrelated LDA [30] are uncorrelated, while 
the discriminant vectors of orthogonal LDA [67] are orthogonal to one another. 
Geometrically, both uncorrelated LDA and orthogonal LDA project the data 
onto the subspace spanned by the centroids. Uncorrelated LDA may be sensitive 
to the noise in the data. Regularized orthogonal LDA is proposed in [70]. 
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The approach of common vectors extracts the common properties of classes in 
the training set by eliminating the differences of the samples in each class [21]. A 
common vector for each individual class is obtained by removing all the features 
that are in the direction of the eigenvectors corresponding to the nonzero eigen- 
values of the scatter matrix of its own class. The discriminative common vectors 
are obtained from the common vectors. Every sample in a given class produces 
the same unique common vector when they are projected onto the null space 
of S,,. The optimal projection vectors are found by using the common vectors 
and the discriminative common vectors are determined by projecting any sample 
from each class onto the span of optimal projection vectors. The discriminative 
common vector method [4] finds optimal orthonormal projection vectors in the 
optimal discriminant subspace. It is equivalent to the null space method, but 
omitting the dimension-reduction step; therefore the method exploits the orig- 
inal high-dimensional space. It combines kernel-based methodologies with the 
optimal discriminant subspace concept. 

Compared with Fisherfaces, exponential discriminant analysis [71] can extract 
the most discriminant information that is contained in the null space of S,,. 
Compared with null-space LDA, the discriminant information that is contained 
in the non-null space of Sẹ is not discarded. Exponential discriminant analy- 
sis is equivalent to transforming the original data into a new space by distance 
diffusion mapping, and LDA is then applied in such a new space. Diffusion map- 
ping enlarges the margin between different classes, improving the classification 
accuracy. 


LDA/GSVD and LDA/QR 


A generalization of LDA by using generalized SVD (LDA/GSVD) [28], [64] can 
be used to solve the problem of singularity of Sw. LDA/GSVD has numerical 
advantages over the two-stage approach, and is a special case of pseudoinverse 
LDA, where the pseudoinverse is applied to S;. It avoids the inversion of Sẹ 
by applying generalized SVD. The nonsingularity of Sẹ is not required, and it 
solves the eigen-decomposition of SİS, [64]. The solution to LDA/GSVD can be 
obtained by computing the eigen-decomposition on the matrix S!S,. LDA/GSVD 
computes the solution exactly without losing any information, but with high 
computational cost. 
The criterion Fo used in [64] is 


Fo(W) = tr[S!S.,]. (15.13) 


LDA/GSVD aims to find the optimal W that minimizes Fo(W), subject to the 
constraint that rank(W7H,) = q, where q is the rank of Sy. 

LDA/QR [68] is also a special case of pseudoinverse LDA, where the pseu- 
doinverse is applied to S, instead. It is a two-stage LDA extension. The first 
stage maximizes the separation between different classes by applying QR decom- 


ww ai bbt.com DOOOO00 


478 


15.6 


15.7 


Chapter 15. Discriminant analysis 


position to a small-size matrix. The distinct property of this stage is its low 
time/space complexity. The second stage incorporates both between-class and 
within-class information by applying LDA to the reduced scatter matrices result- 
ing from the first stage. The computational complexity of LDA/QR is O(Nd) for 
N training examples of d dimensions. LDA/QR scales to large data sets since it 
does not require the entire data in main memory. Both LDA/QR and Fisherfaces 
are approximations of LDA/GSVD, but LDA/QR is much more efficient than 
PCA+LDA. 


Incremental LDA 


Examples of incremental LDA (ILDA) algorithms are neural network based 
LDA [43], IDR/QR [66], and GSVD-ILDA [74]. Iterative algorithms for neu- 
ral network-based LDA [6], [43] require O(d?) time for one-step update, where d 
is the dimension of the data. 

IDR/QR [66] applies QR decomposition at the first stage to maximize the sep- 
arability between different classes. The second stage incorporates both between- 
class and within-class information by applying LDA on the reduced scatter matri- 
ces resulting from the first stage. IDR/QR does not require that the whole data 
matrix be in main memory, which allows it to scale to very large data sets. The 
classification error rate achieved by IDR/QR is very close to the best possible 
one achieved by other LDA-based algorithms. The computational complexity 
of IDR/QR is O(NdK) for N training examples, K classes and d dimensions. 
IDR/QR can be an order of magnitude faster than SVD or generalized SVD- 
based LDA algorithms. Based on LDA/GSVD, GSVD-ILDA [74] determines the 
projection matrix in full space. GSVD-ILDA can incrementally learn an adap- 
tive subspace instead of recomputing LDA/GSVD. It gives the same performance 
as LDA/GSVD but with much smaller computational complexity. GSVD-ILDA 
yields a better classification performance than the other incremental LDA algo- 
rithms [74]. 

In an incremental LDA algorithm [49], Sẹ and S,, are incrementally updated, 
and then the eigenaxes of a feature space are obtained by solving an eigenprob- 
lem. An incremental implementation of the MMC method can be found in [63]. In 
[75], the proposed algorithm for solving generalized discriminant analysis applies 
QR decomposition rather than SVD. It incrementally updates the discriminant 
vectors when new classes are inserted into the training set. 


Other discriminant methods 
Neighborhood component analysis [19] is a nonparametric learning method that 


handles the tasks of distance learning and dimension reduction. It maximizes the 
between-class separability by maximizing a stochastic variant of the leave-one- 
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out k-NN score on the training set. A Mahalanobis distance measure is learned 
for k-NN classification. Nearest neighbor discriminant analysis [51] is another 
nonparametric linear feature extraction method, proposed from the view of the 
nearest neighbor classification. 

Subclass discriminant analysis [76] uses a single formulation for most distri- 
bution types by approximating the underlying distribution of each class with 
a mixture of Gaussians. The method resolves the problem of multimodally dis- 
tributed classes. It is easy to use generalized EVD to find those discriminant 
vectors that best linearly classify the data. The major problem is to deter- 
mine the optimal number of Gaussians per class, i.e., the number of subclasses. 
The method is always the best or comparable to the best, when compared to 
LDA, direct LDA, heteroscedastic LDA, nonparametric discriminant analysis 
and kernel-based LDA. For data with Gaussian homoscedastic subclass struc- 
ture, subclass discriminant analysis does not guarantee to provide the discrim- 
inant subspace that minimizes the Bayes error. Mixture subclass discriminant 
analysis [18] alleviates this shortcoming by modifying the objective function of 
subclass discriminant analysis and utilizes a partitioning procedure to aid dis- 
crimination of data with Gaussian homoscedastic subclass structure. 

The performance of LDA-based methods degrades when the actual distribu- 
tion is non-Gaussian. To address this problem, a formulation of scatter matrices 
extends the two-class nonparametric discriminant analysis to multiclass cases. 
Multiclass nonparametric subspace analysis [37] has two complementary methods 
that are based on the principal space and the null space of the intraclass scatter 
matrix, respectively. Corresponding multiclass nonparametric feature analysis 
methods are derived as enhanced versions of their nonparametric subspace anal- 
ysis counterparts. In another extension of LDA to multi-class [39], the approxi- 
mate pairwise accuracy criteria, which weight the contribution of individual class 
pairs in terms of Bayes error, replace Fisher’s criterion. 

A Bayes-optimal LDA algorithm [22] provides the one-dimensional subspace, 
where the Bayes error is minimized for the C-class problem with homoscedastic 
Gaussian distributions by using standard convex optimization. The algorithm is 
then extended to the minimization of Bayes error in the more general case of 
heteroscedastic distributions by means of an appropriate kernel mapping func- 
tion. 

Recursive LDA [61] determines the discriminant direction for separating dif- 
ferent classes by maximizing the generalized Rayleigh quotient, and generates a 
new sample set by projecting the samples into a subspace that is orthogonal to 
this discriminant direction. The second step is repeated. The kth discriminating 
vector extracted can be interpreted as the kth best direction for separation by 
the nature of the optimization process involved. The recursive process naturally 
stops when the between-class scatter is zero. The total number of discriminating 
vectors from recursive LDA is independent of the number of classes C while that 
of LDA is limited to C — 1. All the discriminating vectors found may not form 
a complete basis of even finite-dimensional feature space. 
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Linear boundary discriminant analysis [45] increases class separability by 
reflecting the different significances of non-boundary and boundary patterns. 
This is achieved by defining two scatter matrices and solving the eigenproblem 
on the criterion described by these scatter matrices. The possible number of 
features obtained is larger than that by LDA, and it brings better classification 
performance. To distinguish the boundary patterns from the non-boundary pat- 
terns, relevant patterns election technique can be employed, by selecting bound- 
ary patterns according to a proximity measure. 

Rotational LDA [54] applies an additional rotational transform prior to 
dimension-reduction transformation. The rotational transform rotates the fea- 
ture vectors in the original feature space around their respective class centroids 
in such a way that the overlap between the classes in the reduced feature space 
is further minimized. 

The IDA technique [47] is based on a numerical optimization of an information- 
theoretic objective function, which can be computed analytically. If the classes 
conform to the homoscedastic Gaussian conditions, IDA reduces to LDA and 
is an optimal feature extraction technique in the sense of Bayes. When class- 
conditional pdfs are highly overlapped, IDA outperforms other second-order tech- 
niques. In [72], the LDA method is proposed by maximizing a non-parametric 
estimate of the mutual information between linearly transformed input data and 
the class labels. The method can produce linear transformations that can signif- 
icantly boost class-separability, especially for nonlinear classification. 

Maximum margin criterion (MMC) [35] is applied to dimension reduction. 
The optimal transformation is computed by maximizing the sum of all interclass 
distances. MMC does not involve the inversion of scatter matrices and thus, 
avoids the small-sample-size problem implicitly. The MMC is defined as [35] 


Jumo(W) = tr(W7 (S, — Su )W). (15.14) 


The projection matrix W can be found as the eigenvectors of Sẹ — Sẹ corre- 
sponding to the largest eigenvalues. MMC is not equivalent to the Fisher cri- 
terion. The discriminant vectors using the two criteria are different. The MMC 
method is an efficient algorithm to compute the projection matrix of MMC under 
the constraint that WT S;¿W = I. It is found to be the same as uncorrelated LDA 
[67]. 

Maximum margin projection [60] aims to project data samples into the most 
discriminative subspace, where clusters are most well-separated. It projects input 
patterns onto the normal of the maximum margin separating hyperplanes. As a 
result, the method only depends on the geometry of the optimal decision bound- 
ary. The problem is a nonconvex one, which can be decomposed into a series of 
convex subproblems using the constrained concave-convex procedure (CCCP). 
The computation time is linear in the size of data set. Maximum margin projec- 
tion extracts a subspace more suitable for discrimination than geometry-based 
methods. 
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Figure 15.3 Classification of the iris data set using discriminant analysis. (a) x2 vs. £a. (b) £1 vs. £4. 


15.8 


Example 15.3: Discriminant analysis fits a parametric model to the training 
data and interpolate to classify new data. Linear discrimination fits a multi- 
variate normal density to each group, with a pooled estimate of covariance. 
Quadratic discrimination fits multivariable normal densities with covariance esti- 
mates stratified by group. Both methods use likelihood ratios to assign obser- 
vations to groups. By applying quadratic discriminant analysis on Fisher’s iris 
data set, the classification between class 2 and class 3 is shown in Fig. 15.3. 


Nonlinear discriminant analysis 


The nonlinear discriminant analysis network proposed in [11] uses the MLP archi- 
tecture and Fisher’s determinant ratio criterion. After an MLP-like nonlinear 
mapping of input vectors, the eigenvector-based linear map of Fisher’s analysis 
is applied to the last hidden-layer outputs. When compared with MLP, the non- 
linear discriminant analysis network can provide better results in imbalanced- 
class problems. For these problems, MLP tends to underemphasize the small 
class samples, while the target-free training of nonlinear discriminant analysis 
gives more balanced classifiers. Natural-gradient training for nonlinear discrimi- 
nant analysis network [20] is comparable with those obtained with CG training, 
although CG has reduced complexity. 

A layered lateral network-based LDA network and an MLP-based nonlinear 
discriminant analysis network are proposed in [43]. The two-layer LDA network 
determines the LDA projection matrix, where each layer is a Rubner-Tavan PCA 
network. This algorithm performs a simultaneous diagonalization of two matri- 
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ces: S,, and S;. The network output gives the eigenvectors of S3! S,. This method 
has slow convergence, particularly when the input dimension is high [43]. 

Generalized discriminant analysis [1] extends LDA from linear domain to a 
nonlinear domain via the kernel trick. It aims to solve a generalized eigenevalue 
problem, which is always implemented by SVD. Semi-supervised generalized dis- 
criminant analysis [73] utilizes unlabeled data to maximize an optimality cri- 
terion of generalized discriminant analysis and formulates the problem as an 
optimization problem that is solved using CCCP. Kernel subclass discriminant 
analysis [76] can resolve the problem of nonlinearly separable classes by using the 
kernel between-subclass scatter matrix. Many kernel-based nonlinear discrimi- 
nant analysis methods are expounded in Section 17.3. 

Locally linear discriminant analysis [31] is an approach to nonlinear discrim- 
inant analysis that involves a set of locally linear transformations. Input vec- 
tors are projected into multiple local feature spaces by the corresponding linear 
transformations to yield classes that maximize the between-class covariance while 
minimizing the within-class covariance. For nonlinear multiclass discrimination, 
the method is computationally highly efficient compared to generalized discrim- 
inant analysis [1]. The method does not suffer from overfitting due to the linear 
base structure of the solution. 

Semi-supervised local LDA [58] preserves the global structure of unlabeled 
samples in addition to separating labeled samples in different classes from one 
another. It has an analytic form of the globally optimal solution and can be 
computed based on eigen-decomposition. 

LDA tends to give undesired results if samples in a class form several separate 
clusters (multimodal) or there are outliers. Locality-preserving projection [26] is 
an unsupervised dimension-reduction method that works well with multimodal 
labeled data due to its locality-preserving property. It seeks a transformation 
matrix such that nearby data pairs in the original space are kept close in the 
embedding space. The idea of Laplacian score is to evaluate each feature by its 
locality-preserving power, showing similarity in spirit to locality-preserving pro- 
jection [27]. Local LDA [57] effectively combines the ideas of LDA and locality- 
preserving projection, that is, local LDA maximizes between-class separability 
and preserves within-class local structure, and thus works well even when within- 
class multimodality or outliers exist. The solution can be easily computed just 
by solving a generalized eigenvalue problem, thus resulting in high scalability in 
data visualization and classification tasks. 


Two-dimensional discriminant analysis 


Some feature extraction methods have been developed by representing images 
with matrix directly. Two-dimensional PCA and generalized low-rank approxi- 
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mations of matrices [69] reduce the reconstruction error without considering the 
classification. 

Two-dimensional LDA algorithms [65, 36, 32, 62] provide an efficient approach 
to image feature extraction and can overcome the small-sample-size problem. 
Based on image matrix, two-dimensional LDA aims to minimize the within-class 
distances and maximize between-class distances, and fails when each class does 
not belong to single Gaussian distribution or their centers overlap. 

Two-dimensional nearest-neighbor discriminant analysis [52] extracts features 
to improve the performance of nearest-neighbor classification. The method can 
be regarded as a two-dimensional extension of nearest-neighbor discriminant 
analysis [51] with matrix-based image representation. 

Two-directional two-dimensional LDA [46] is proposed for object/face image 
representation and recognition to straighten out the problem of massive memory 
requirements of the two-dimensional LDA method. It has the advantage of higher 
recognition rate, less memory requirements and better computing performance 
than standard PCA, 2D-PCA, two-dimensional LDA methods. 


Problems 


15.1 Consider four points in two classes: class 1: (2,4), (1, —4); class 2: (—2, 3), 
(—4,2). Compute the scatter matrices for LDA. 


15.2 Show how to transform the generalized eigenvalue problem 
max a! Spe subject to x’ Suz = 1 
into a standard eigenvalue problem. 


15.3 Consider a data set {(1, 1, 1), (1, 2,1), (1.5, 1,1), (1,3, 1), (4, 4, 2), (3, 5, 2), (5, 4, 2), (6, 4, 2)}, 
where each pattern consists of a x-coordinate, a y-coordinate and a class label. 
Find the projection directions associated with LDA. 
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Support vector machines 


Introduction 


SVM [14], [209] is one of the most popular nonparametric classification algo- 
rithms. It is optimal and is based on computational learning theory [208, 210]. 
The goal of SVM is to minimize the VC dimension by finding the optimal hyper- 
plane between classes, with the maximal margin, where the margin is defined as 
the distance of the closest point in each class to the separating hyperplane. It has 
a general-purpose linear learning algorithm and a problem-specific kernel that 
computes the inner product of input data points in a feature space. The key idea 
of SVM is to project the training set in a high-dimensional space into a lower- 
dimensional feature space by means of a set of nonlinear kernel functions, where 
the projections of the training examples are always linearly separable in the fea- 
ture space. The hippocampus, a brain region critical for learning and memory 
processes, has been reported to possess pattern separation function similar to 
SVM [7]. 

SVM is a three-layer feedforward network. It implements the structural risk- 
minimization (SRM) principle that minimizes the upper bound of the general- 
ization error. This induction principle is based on the fact that the generaliza- 
tion error is bounded by the sum of a training error and a confidence-interval 
term that depends on the VC dimension. Generalization errors of SVMs are not 
related to the input dimensionality, but to the margin with which it separates 
the data. Instead of minimizing the training error, SVM purports to minimize 
an upper bound of the generalization error and maximizes the margin between 
a separating hyperplane and the training data. 

SVM is a universal approximator for various kernels [76]. It is popular for 
classification, regression and clustering. One of the main features of SVM is the 
absence of local minima. SVM is defined in terms of a subset of the learning 
data, called support vectors. It is a sparse representation of the training data, 
and allows the extraction of a condensed data set based on the support vectors. 

Kernel methods have been known as kernel machines. A kernel function 
k(x,a’) is a transformation function that satisfies Mercer’s theorem. A Mer- 
cer kernel, i.e., a continuous, symmetric and positive definite function, indicates 
that the kernel matrix has to be semidefinite; that means it only has positive 
eigenvalues. A kernel can be expressed as an inner-product operation in some 
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Figure 16.1 Kernel-based transformation: from input space to feature space. Linear separation is 
produced in the feature space. 


high-dimensional feature space: 
k(a, a") =< (2), da’) >, (16.1) 


where 6:2 — F, i.e., d(x) is the image of x from input space Z to space F. 
Examples of Mercer kernels are the Gaussian kernel 


k(x, a!) = ele-#'I?/0? (16.2) 
and the polynomial kernel 
k(x, a’) = (1+ <a,a' >), (16.3) 


where c > 0 and d is a positive integer. 

According to Mercer’s work [139], a nonnegative linear combination of Mercer 
kernels is also a Mercer kernel, and the product of Mercer kernels is also a 
Mercer kernel. The performance of every kernel-based method depends on the 
kernel type selected. However, there are no general theories for choosing a kernel 
in a data-dependent way. 

Nonlinear kernel functions are used to overcome the curse of dimensionality. 
The space of the input examples R” is mapped onto a high-dimensional feature 
space so that the optimal separating hyperplane built on this space allows a good 
generalization capacity. By choosing an adequate mapping, the input examples 
become linearly or almost linearly separable in the high-dimensional space. This 
mapping transforms nonlinear separable data points in the input space into linear 
separable ones in the resulting high-dimensional space (see Fig. 16.1). 

Let the kernel matrix be K = [k(2;,x;)],,,.,,- If for all the n data points and 
any vector v € R” the inequality v” Kv > 0 holds, then k(-) is said to be positive 
definite. If this is only satisfied for those v with 17w = 0, then k(-) is said to be 
conditionally positive definite. A kernel is indefinite, if for some K there exist 
vectors v and v’ with vT Kv > 0 and v7 Kv’ < 0. 

The squared Euclidean distance has been generalized into a high-dimensional 
space F via the kernel trick 


I|o(x) — gy)? = k(x, x) + k(y, y) — 2k(x,y). (16.4) 


This generalization becomes possible provided that the kernel is conditionally 
positive definite. 
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Under mild assumptions, the solution of support vector regression (SVR) can 
be written as a linear combination of kernel functions. This is known as the 
representer theorem. 


Theorem 16.1 (Representer theorem). A mapping f can be written as a 
linear combination of kernel functions 


N 
f(x) = 5 aik(xi, æ), (16.5) 
i=1 
where a; € R are suitable coefficents. 


From (16.5) only samples x; with a; 4 0 have influence on f(a); such samples 
are called support vectors. If k(-) is a polynomial kernel it is easy to see that the 
representation (16.5) is not unique whenever the sample size N is too large. 

The representer theorem is generalized to differentiable loss functions [43] and 
even arbitrary monotonic ones [167]. A quantitative representer theorem has 
been proven in [188] without using the dual problem when convex loss functions 
are considered. In [48], an alternative formulation of the representer theorem is 
derived for convex loss functions. 

A classifier is called universally consistent if the risks of its decision functions 
converge to the Bayes risk in probability for all underpining distributions. Lower 
(asymptotical) bounds on the number of support vectors is established in [188]. 

A relationship between SVM and a sparse approximation scheme that resem- 
bles the basis pursuit denoising algorithm is given in [68]. If the data are noiseless, 
the modified basis pursuit denoising method [68] is equivalent to SVM. The SVM 
technique can also be derived in the framework of regularization theory, establish- 
ing a connection between SVM, sparse approximation and regularization theory 
[68]. 


Example 16.1: We revisit the XOR problem discussed in Example 10.1. Define 
the kernel k(x, x;) = (1+ æT æ;)?, where æ = (21, 22)" and a; = (xi, 2i2)7. The 
training samples 2; = (—1,—1) and a, = (+1,+1) belong to class 0, and a2 = 
(—1, +1), £3 = (+1, —1) to class 1. 

Expanding the kernel function, we have 


k(x,xzi)=1+ ree + 2£1Z2Zi1 Vig + xen, + 2a, 241 + 2z2£i2 = O(a) - (xi), 
where 


olx) = (1, ti, V2x122, i, V201, V2r2)7, 
lzi) = an as V 2211242, T, V2zi, V2zi2)". 


The feature space defined by (æ) is six-dimensional. To discriminate the four 
examples in the feature space, we define the decision boundary in 7,22 = 0. 
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Figure 16.2 An illustration of the hyperplane in the two-dimensional feature space of SVM: The 
margin is defined by the distance between the hyperplane and the nearest of the examples in the two 
classes. Those examples in circles are support vectors, against which the margin pushes. 


When 2122 > 0, an example is categorized into class 0, and otherwise into class 
i; 


16.2 SVM model 


SVM was originally proposed for binary classification. Let f : £ — {—1,1} be an 
unknown function and D = { (x; yi)li = 1,..., N} C R” x {—1,1} be the train- 
ing example set. SVM aims to find the function f with the optimal hyperplane 
that maximizes the margin between the examples of two different classes, as 
illustrated in Fig. 16.2. 

The optimal hyperplane for linear SVM can be constructed by solving the 
following primal QP problem [209]: 


N 
; 1 
min Eo(w, £) = zllwl? + Cd & (16.6) 
subject to 
Yp (wa, +0) >1-&, p=1,...,N, (16.7) 
& 20, p=1,...,N, (16.8) 
where € = (é, ên)”, p are slack variables, w and 0 are the weight and 


bias parameters for determining the hyperplane, y, € {—1,+1} is the described 
output of the classifier, and C is a regularization parameter that trades off wide 
margin with a small number of margin failures. Typically, C is optimized by 
employing statistical model selection procedures, e.g. crossvalidation. 

In the above SVM formulation, the slack variable €, in (16.6) corresponds 
to a loss function using Lı-norm. L,-SVM solves the following unconstrained 
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optimization problem: 


l 
Il 
min J(w) = zu w + cy max(1 — yjw" æj, 0). (16.9) 
j=l 


Lə-SVM uses the sum of squared losses, and solves 


l 
1 
min J(w) = zw w + Coy [max(1 — yyw" zj, 0] i (16.10) 
j=1 
Lı-SVMs are more popularly used than Lə-SVMs because they usually yield 
classifiers with a much less number of support vectors, thus leading to better 
classification speed. SVM is related to regularized logistic regression, which solves 
1 l 
min J(w) = aU wta Y bgl + 6%). (16.11) 
j=1 
By applying the Lagrange multiplier method and replacing rlx by the kernel 
function k (æp, Œ), the ultimate objective of SVM learning is to find @p, p= 


1,..., N, so as to minimize the dual quadratic form [209] 
INN N 
Esm = 5 5 >, Upyik (Ep, 2i) apai — X ap (16.12) 
g= i= p=l1 
subject to 
N 
X wap = 0, (16.13) 
p=1 
O<a,<C, p=l,...,N, (16.14) 


where a, is the weight for the kernel corresponding to the pth example. The 
kernel function k (æp, £) = 6” (£p) Ø (£), where the form of $(-) is implicitly 
defined by the choice of the kernel function and does not need to be given. When 
k(-) is a linear function, that is, k (æp, £) = «>a, SVM reduces to linear SVM 
[208]. The popular Gaussian and polynomial kernels are, respectively, given by 
(16.2) and (16.3). 

The SVM output is given by 


y(x) = sign(w? p(x) + 0). (16.15) 


where @ is a threshold. After maniputation, the SVM output gives a classification 
decision 


N 
y(x) = sign (>: OpYpk (ap, £) + o) l (16.16) 


p=1 
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The QP problem given by (16.12) through (16.14) will terminate when all of 
the KKT conditions are fulfilled 
Ypüp 21, Ay =O 
Yptp =1, O<an<C, (16.17) 
Ypüp <1, ap=C 


where up is the SVM output for the pth example. Those patterns with nonzero 
Qp are the support vectors, which lie on the margin. 

In two-class L2-SVM, all the slack variables €;’s receive the same penalty factor 
C. For imbalanced data sets, a common remedy is to use different C’s for the 
two classes. In the more general case, each training pattern can have its own 
penalty factor C;. 

The generalization ability of SVM depends on the geometrical concept of span 
of support vectors [211]. The value of the span is proved to be always smaller 
than the diameter of the smallest sphere containing the support vectors, used in 
previous bounds [210]. The prediction of the test error given by the span is very 
accurate and has direct application in the choice of the optimal parameters of 
SVM. Bounds on the expectation of error from SVM is derived from the leave- 
one-out estimator, which is an unbiased estimate of the probability of test error 
[211]. 

An important approach for efficient SVM model selection is to use differen- 
tiable bounds of the leave-one-out error. For model selection, the radius mar- 
gin bound for L2-SVM outperforms that for Lı-SVM [51]. L1-SVM possesses 
the advantage of having fewer support vectors than L2-SVM. The selection of 
hyperparameters is investigated by using k-fold crossvalidation and leave-one- 
out criteria [51]. The gradient-descent algorithm is used for automatically tuning 
multiple parameters in an SVM [51]. 

The VC dimension of hyperplanes with margin p is less than D?/4p?, where 
D is the diameter of the smallest sphere containing the training points [209]. 
SVM can have very large (even infinite) VC dimension by computing the VC 
dimension for homogeneous polynomial and Gaussian RBF kernels [17]. This 
means that an SVM has very strong classification/regression capacity. 

When both SVMs and feedforward networks use similar hidden-layer weights, 
accuracies are very similar [162]. Regarding the number of support vectors, 
sequential feedforward networks construct models with less hidden units than 
SVMs do and in the same range as sparse SVMs do. Computational time is 
lower for SVMs. The separating hyperplane for two-class classification obtained 
by SVM is shown to be equivalent to the solution obtained by LDA on the set 
of support vectors [174]. 

Similar to biological systems, an SVM ignores typical examples but pays atten- 
tion to borderline cases and outliers. SVM is not obviously applicable to the 
brain. Bio-SVM [87] is a biologically feasible SVM. An unstable associative mem- 
ory oscillates between support vectors and interacts with a feedforward classifica- 
tion pathway. Instant learning of surprising events and off-line tuning of support 


ww ai bbt.com DOOOO00 


16.3 


Support vector machines 495 


vector weights train the system. Emotion-based learning, forgetting trivia, sleep 
and brain oscillations are phenomena that agree with the Bio-SVM model, and 
a mapping to the olfactory system is suggested. 


Solving the quadratic programming problem 


Lı-norm soft-margin SVM, also called quadratic programming (QP) SVM, was 
introduced with polynomial kernels in [14] and with general kernels in [42]. Linear 
programming (LP) SVM is efficient and performs even better than QP-SVM for 
some purposes because of its linearity and flexibility for large data sets. An error 
analysis shows that the convergence behavior of LP SVM is almost the same 
as that of QP SVM [226]. By employing the Lı or Læ norm in maximizing 
margins, SVMs result in an LP problem that requires a lower computational 
load compared to SVMs with Lə norm. 

The use of kernel mapping transforms SVM learning into a quadratic opti- 
mization problem, which has one global solution. There are many general opti- 
mization tools such as CPLEX, LOQO, MATLAB linprog and quadprog capable 
of solving linear and quadratic programs derived from SVMs. However, given N 
training patterns, a naive implementation of the QP solver takes O(N?) training 
time and at least O(N) space. 

General convex QPs are typically solved by an interior-point method or an 
active-set method. If the Hessian Q of an objective function and/or the con- 
straint matrix of the QP problem is large and sparse, then an interior-point 
method is usually selected. If the problem is of moderate size but the matri- 
ces are dense, then an active-set method is preferable. In SVM problems the Q 
matrix is typically dense. Thus, large SVM problems present a challenge for both 
the approaches. For some classes of SVMs, Q is dense but low-rank; in such a 
case, one can adapt an interior-point method to work very efficiently [58], [57]. 
However, if the rank of Q is high, an active-set method seems to be suitable. 

The simplex method for LP problems is a traditional active-set method. It has 
very good practical performance. The idea is also used for solving QP problems. 
The active-set methods can benefit very well from warm starts, while the interior- 
point methods cannot. For instance, if some additional labeled training data 
become available, the old optimal solution is used as a starting point for the 
active-set algorithm and the new optimal solution is typically obtained within a 
few iterations. An active-set algorithm called SVM-QP [164] solves the convex 
QP problem by using the simplex method for convex quadratic problems. SVM- 
QP has an overall performance better than that of SVMlight, and has identical 
generalization properties. In addition, SVM-QP has better theoretical properties 
and it naturally, and almost without change, extends to incremental mode. This 
method fixes, at each iteration, all variables in the current dual active-set at 
their current values (0 or C), and then solves the reduced dual problem. 
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The most common approach to large SVM problems is to use a restricted 
active-set method, such as chunking [14] or decomposition [150], [158], [90], where 
at each iteration only a small number of variables are allowed to be varied. 
These methods tend to have slow convergence when getting closer to the optimal 
solution. Moreover, their performance is sensitive to the changes in the chunk 
size and there is no good way of predicting a good choice for the size of the 
chunks. A full active-set method avoids these disadvantages. Active-set methods 
for SVM were used in [78] for generating the entire regularization path of the 
cost parameter for standard D2-SVM. 


Chunking 


The chunking technique [208], [210] breaks down a large QP problem into a series 
of smaller QP subproblems, whose ultimate goal is to identify all nonzero ap, 
since training examples with a, = 0 do not change the solution. The chunking 
algorithm starts with an arbitrary subset (chunk of data, working set) which 
can fit in the memory and solves the optimization problem on it by the general 
optimizer, and trains an initial SVM. Support vectors remain in the chunk while 
other points are discarded and replaced by a new working set with gross viola- 
tions of KKT conditions. Then, the SVM is retrained and the whole procedure 
is repeated. Chunking suffers from the problem that the entire set of support 
vectors that have been identified will still need to be trained at the end of the 
training process. 

Chunking is based on the sparsity of SVM’s solution, and support vectors 
actually take up a small fraction of the whole data set. There may be many 
more active candidate support vectors during the optimization process than the 
final ones so that their size can go beyond the chunking space. However, the 
resultant kernel matrix may still be too large to fit into memory. The method of 
selecting a new working set by evaluating KKT conditions without efficient kernel 
caching may lead to a high computational cost. Standard projected conjugate 
gradient chunking algorithm scales somewhere between O(N) and O(N?) in the 
training set size N [40], [90]. 


Decomposition 


A large QP problem can be decomposed into smaller QP subproblems [150], 
[151], [158], [213]. Each subproblem is initialized with the results of the previous 
subproblem. However, decomposition requires a numerical QP algorithm such 
as the projected conjugate gradient algorithm. For decomposition approach, the 
time complexity can be reduced from O(N®) to O(N?). 

A basic strategy commonly used in decomposition methods is to execute two 
operations repeatedly until some optimality condition is satisfied; one is to select 
q variables among / and the other is to minimize the objective function by updat- 
ing only the selected q variables. The set of q variables selected for updating at 
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each step is called the working set. Only a fixed-size subset (working set) of the 
training data are optimized each time, while the variables corresponding to the 
other patterns are frozen [151]. Sequential minimal optimization (SMO) [158] 
selects only two variables for the working set in each iteration. SVMlight [90] 
sets the size of the working set to any even number q. 

SMO selects at each iteration a working set of size exactly two. Each small 
QP problem involves only two a, and is solved analytically. The first variable is 
chosen among points that violate the KKT conditions, while the second variable 
is chosen so as to have a large increase in the dual objective. This two-variable 
joint optimization process is repeated until the loose KKT conditions are ful- 
filled for all training patterns. This avoids using QP optimization as an inner 
loop. The amount of memory required for SMO is O(N). SMO has a computa- 
tional complexity of somewhere between O(N) and O(N7?), and it is faster than 
projected conjugate gradient chunking. The computational complexity of SMO 
is dominated by SVM evaluation, thus SMO is very fast for linear SVMs and 
sparse data sets [158]. 

The performance of SMO is enhanced in [102] by replacing one-thresholded 
parameters with two-thresholded parameters, since the pair of patterns chosen 
for optimization is theoretically determined by two-thresholded parameters. The 
two-parameter SMO algorithm performs significantly faster than SMO. Three- 
parameter SMO [125], as a natural generalization to the two-parameter SMO 
algorithm, jointly optimizes three chosen parameters in a manner similar to that 
of the two-parameter SMO. It outperforms two-parameter SMO significantly in 
both the executing time and the computational complexity for classification as 
well as regression benchmarks. 

SMO algorithms have strict decrease of the objective function if and only if 
the working set is a violating pair [84]. However, the use of generic violating 
pairs as working sets is not sufficient to guarantee convergence properties of the 
sequence generated by a decomposition algorithm. As each iteration only involves 
two variables in the optimization, SMO has slow convergence. Nevertheless, as 
each iteration is computationally simple, an overall speedup is often observed in 
practice. Decomposition methods with a-seeding are extremely useful for solving 
a sequence of linear SVMs with more data than attributes [96]. Analysis shows 
why a-seeding is much more effective for linear than nonlinear SVMs [96]. 

Working-set selection is an important step in decomposition methods for train- 
ing SVMs. A popular way to select the working set is via the maximal violating 
pair. An SMO-type algorithm using maximal violating pairs as working sets is 
usually called maximal-violating-pair algorithm. 

SMO is used in LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) 
[19]. In LIBSVM ver. 2.8, a new working set selection partially exploits the 
second-order information, thus increasing the convergence speed but also getting 
a moderate increase of the computational cost with respect to standard selec- 
tions [55]. This achieves a theoretical property of linear convergence. A similar 
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working-set selection strategy, called hybrid maximum-gain working-set selection 
[69], minimizes the number of kernel evaluations per iteration. This is achieved 
by the avoidance of cache misses in the decomposition algorithm. The hybrid 
maximum-gain algorithm has an efficient usage of the matrix cache. It reselects 
almost always one element of the previous working set. Therefore, at most one 
matrix row needs to be computed in every iteration. Hybrid maximum-gain work- 
ing set selection converges to an optimal solution. In contrast, for small problems 
LIBSVM ver. 2.8 is faster. 

In SVMlight [90], a good working set is selected by finding the steepest feasible 
direction of descent with q nonzero elements. The q variables that correspond to 
these elements compose the working set. When q is equal to 2, the selected work- 
ing set corresponds to the optimal pair in a modified SMO method [19]. SVMlight 
caches q rows of kernel matrix (row caching) to avoid kernel reevaluations and 
LRU (least recently used) is applied to update the rows in the cache. However, 
when the size of the training set is very large, the number of cached rows, which 
is dictated by the user, becomes small due to limited memory. The number of 
active variables is not large enough to achieve fast optimization. A generalized 
maximal-violating-pair policy for the working-set selection and a numerical solver 
for the inner QP subproblems are needed. For small working sets (q = O(10)), 
SVMlight often exhibits comparable performance with LIBSVM. 

Sigmoidal kernels may lead to non-positive-semidefinite kernel matrices which 
are required by the SVM framework to obtain the solution by means of QP 
techniques. SMO decomposition is used to solve nonconvex dual problems, lead- 
ing to the software LIBSVM, which is able to provide a solution with sigmoidal 
kernels. An improved SVM with a sigmoidal kernel, called support vector per- 
ceptron [144], provides very accurate results in many classification problems, 
providing maximal margin solutions when classes are separable, and also pro- 
ducing very compact architectures comparable to MLPs. In contrast, LIBSVM 
with sigmoidal kernel has a much larger architecture. 

SimpleSVM [213] is a related scale-up method. At each iteration, a point vio- 
lating the KKT conditions is added to the working set by using rank-one update 
on the kernel matrix. However, storage is still a problem when SimpleSVM is 
applied to large dense kernel matrices. SimpleSVM divides the database into 
three groups: those groups for the nonbounded support vectors (0 < aw < C), 
for the bounded points - misclassified or in the margins (ag = C) and for the 
non-support vectors (ao = 0). SimpleSVM solves an optimization problem such 
that the optimality condition leads to a linear system with a,, as unknown. 

Many kernel methods can be equivalently formulated as minimum enclosing 
ball problems in computational geometry [203]. By adopting an efficient approxi- 
mate minimum enclosing ball algorithm, one obtains provably approximate opti- 
mal solutions with the idea of core sets. The core set in core vector machine 
plays a similar role as the working set in other decomposition algorithms. Kernel 
methods (including the soft-margin one-class and two-class SVMs) are formu- 
lated as equivalent minimum enclosing ball problems, and then approximately 
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optimal solutions are efficiently obtained by using core sets. Core vector machine 
[203] can be used with nonlinear kernels and has a time complexity O(N) and a 
space complexity that is independent of N. Compared with existing SVM imple- 
mentations, it is as accurate but is much faster and can handle much larger 
data sets. On relatively small data sets where N < 2/e, SMO can be faster. 
One could vary £ to adjust the tradeoff between efficiency and approximation 
quality, and e = 10~® is acceptable for most tasks. The minimum enclosing ball 
is equivalent to the hard-margin support vector data description (SVDD) [194]. 
The hard- (soft-) margin SVDD then yields identical solution as the hard- (soft-) 
margin one-class SVM, and the weight w in the one-class SVM solution is equal 
to the center c in the SVDD solution [168]. Finding the soft-margin one-class 
SVM is essentially the same as fitting the minimum enclosing ball with outliers. 
Core vector machine is similar to decomposition algorithms, but subset selection 
is much simpler. Moreover, while decomposition algorithms allow training pat- 
terns to join and leave the working set multiple times, patterns once recruited 
as core vectors by core vector machine will remain there during the whole train- 
ing process. Core vector machine solves the QP on the coreset only using SMO 
and thus obtains a. The stopping criterion is analogous to that for v-SVM [21]. 
Core vector machine critically requires the kernel function k(x, x) = constant for 
any æ. This condition is satisfied for the isotropic kernel (e.g. Gaussian kernel), 
the dot product kernel (e.g. polynomial kernel) with normalized inputs, and any 
normalized kernel. 

Core vector machine does not converge towards the solution for all hyperpa- 
rameters [127]. It also requires that the kernel methods do not have a linear term 
in their dual objectives so as to make them minimum enclosing ball problems; 
it has been shown in [203] that this holds for the one-class SVM [168] and two- 
class SVM, but not for SVR. Generalized core vector machine introduces the 
center-constrained minimum enclosing ball problem [204] to make SVR also a 
minimum enclosing ball problem. It can be used with any linear/nonlinear kernel 
and can also be applied to kernel methods such as SVR, the ranking SVM, and 
two-class SVM for imbalanced data. It has the same asymptotic time complexity 
and space complexity as those of core vector machine. It has good performance, 
but is faster and produces few support vectors on very large data sets. 


Example 16.2: We use LIBSVM to classify 463 samples belonging to three classes 
in two-dimensional space. By selecting RBF with eee? with y = 200 and 
C = 20, all the samples are correctly classified. The total number of supporting 
vectors is 172. The classification result is shown in Fig. 16.3. 


Example 16.3: We revisit the Iris data set. By setting every fifth sample of the 
data set as a test sample and the remaining as training samples, we implement 
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Figure 16.3 SVM boundary for a three-class classification problem in two-dimensional space. 


16.3.3 


LIBSVM for classification. By selecting a polynomial (yu? v + co)?, with y = 10, 
d = 3, co = 0, C = 2, we get a testing accuracy of 96.6667% (29/30). The total 
number of supporting vectors is 13. 








Convergence of decomposition methods 


From the theoretical point of view, the policy for updating the working set plays 
a crucial role since it can guarantee a strict decrease of the objective function at 
each step [84]. The global convergence property of decomposition methods and 
SMO algorithms for classification has been clarified in [123], [100], [122], [28]. 
The convergence properties of SVMlight algorithm have been proved in [122], 
[123] under suitable convexity assumptions. 

In case of working sets of minimal size 2, a proper selection via the maximal- 
violating-pair principle is sufficient to ensure asymptotic convergence of the 
decomposition scheme [100], [123], [28]. For larger working sets, convergence 
proofs are available under a further condition which ensures that the distance 
between two successive approximations tends to zero [122]. The generalized SMO 
algorithm has been proved to terminate within a finite number of iterations under 
a prespecified stopping condition and tolerance [100]. A simple asymptotic con- 
vergence proof of the linear convergence of SMO-type decomposition methods 
under a general and flexible way of choosing the two-element working set are 
given in [28]. The generalized SMO algorithm and SVMlight have been proved 
in [189] to have the global convergence property. 

SVM is well understood when using conditionally positive-definite kernel func- 
tions. However, in practice, non-conditionally positive definite kernels arise in 
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SVMs. Using these kernels causes loss of convexity. The LIBSVM software does 
converge for indefinite kernels such as the sigmoidal kernel. In [75], a step toward 
the comprehension of SVM classifiers in these situations is provided. A geomet- 
ric interpretation of SVMs with indefinite kernel functions is given. Such SVMs 
are shown to be optimal hyperplane classifiers not by margin maximization, but 
by minimization of distances between convex hulls in pseudo-Euclidean spaces. 
They are minimum distance classifiers with respect to certain points from the 
convex hulls of embedded training points. 


Least-squares SVMs 


Least-squares SVM (LS-SVM) [183], [185] is a variant of SVM which simplifies 
the training process of SVM to a great extent. It is introduced as a reformulation 
to SVM by replacing the inequality constraints with equality ones. It obtains an 
analytical solution directly from solving a set of linear equations instead of a QP 
problem. 

The unknown parameters in the decision function f(a), namely a, and 6, can 
be solved through the primal problem: 


1 geal 
. T 2 
min zw w +> Dg (16.18) 
p=l1 
subject to 
Yp(wi d(a,) +0) =1-&, p=1,...,N, (16.19) 


where ¢(-) is a linear or nonlinear function which maps the input space into a 
higher-dimensional feature space, w is a weight vector to be determined, C is a 
regularization constant and €,’s are slack variables. The Lagrangian is obtained 
as 


1 C N N 
L(w, b; &; a) = zw w T 9 yE = >». Qp [yplwT (xp) =F 0) -1 ag ĉp] ; 
p=1 p=1 


(16.20) 
where the Lagrange multipliers a, are used as the same a, in f (æ). The necessary 
conditions for the optimality are given by the KKT conditions: 


aL cu 


p=1 
OL x 
C EE p=1,...,N, (16.23) 
OG, 
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— =0 => yp(w? (ay) +0-1+&=0, p=l,...,N. (16.24) 


Eliminating w and p leads to 


0 -Y7 0 0 
Ly ave] [ol > [a] 06.25) 
where Y = (yı, EE YN)”, Qij = yiyjolx:) T O(a;), 1= (1, Tor i, and Q = 
(a1,... an)”. The function (a) satisfies Mercer’s theorem: ¢(a;)"¢(a;) = 
k(£i, zj). 
The explicit solution to (16.25) is given by 
YTA! 


a=A7'(1-6Y) (16.26) 


3 


yay 
where A = Q + CH. The LS-SVM output for classification is given by (16.16). 

Fixed-size LS-SVM [183] can rapidly find the sparse approximate solution of 
LS-SVM. It solves the LS problem in the primal space instead of in the dual 
space. The selection of hyperparameters plays an important role to the perfor- 
mance of LS-SVM. Among the kernel families investigated, the best or quasi-best 
test performance could be obtained by using the scaling RBF and RBF kernel 
functions [74]. 

LS-SVM obtains good performance on various classification problems, but has 
two limitations [67]. The computational complexity of LS-SVM usually scales 
O(N®) for N samples. The solution of LS-SVM lacks sparseness, which causes 
very slow test speed. There are some fast algorithms for LS-SVM, such as a CG 
algorithm [184] and an SMO algorithm [101], and a coordinate-descent algorithm 
[118]. These algorithms achieve low complexity, but their solutions are not sparse. 
In [128], the applicability of SMO is explored for solving the LS-SVM problem, 
by comparing first-order and second-order working-set selections concentrating 
on the RBF kernel. Second-order working-set selection is more convenient than 
first-order one. The number of kernel operations performed by SMO is O(N). 
Moreover, asymptotic convergence to the optimum is proved and the rate of 
convergence is shown to be linear for both selections. 

Fast sparse approximation for LS-SVM [89] overcomes the two limitations of 
LS-SVM. It is a fast greedy algorithm. It iteratively builds the decision function 
by adding one basis function from a kernel-based dictionary at a time based on 
a flexible and stable ¢-insensitive stopping criterion. A probabilistic version of 
the algorithm further improves its speed by employing a probabilistic speedup 
scheme. 

A simple approach to increase sparseness can be introduced by sorting the 
support value spectrum, i.e., the absolute value of the solution of LS-SVM. In 
[185], a sparse LS-SVM is constructed by deleting training examples associated 
with the smallest magnitude a(i) term that is proportional to the training error 
e;. This algorithm is refined in [107]. Choosing the smallest a(i) does not neces- 
sarily result in the smallest change in training error when the parameters of the 


ww ai bbt.com DOOOO00 


Support vector machines 503 


algorithm are updated. A pruning method based on minimizing the output error 
is used. The algorithm uses a pruning method with no regularization (y = oo), 
leading to inversion of a singular matrix. A procedure of pruning with regular- 
ization (y finite and nonzero) is implemented in [108] to make the data matrix 
nonsingular; it uses a selective window algorithm that is computationally more 
efficient, as it adds and deletes training examples. 


Weighted least-squares SVM 

Weighted LS-SVM [187] improves LS-SVM by adding weights on error variables 
to correct the biased estimation of LS-SVM and to obtain robust estimation 
from noisy data. It firstly trains the samples using LS-SVM, then calculates the 
weights for each sample according to its error variable, and finally solves weighted 
LS-SVM. Traditional weight-setting algorithm for weighted LS-SVM depends on 
results from LS-SVM and requires retraining of weighted LS-SVM. 

The weighted LS-SVM model can be described as [187] 


N 


ol 2, C 2 
min 5 |||? + 5 dL (16.27) 
p=1 
subject to 
Yp = w' b(@p) +O0+&, p=l,...,N, (16.28) 
where v, is determined by 
1, if |€/8| < c1 
ig — if c1 < |&/8| < c2 ; (16.29) 
1074, otherwise 


cı and cg being thresholds. Assuming that £p has a Gaussian distribution, the 
statistic § = IQR/(2 x 0.6745) or = 1.488MAD(z;), where IQR stands for the 
interquartile range (the difference between the 75th percentile and 25th per- 
centile), and MAD is the median absolute deviation. To calculate vp, one can 
firstly train LS-SVM, and then compute § from the ép distributions. 

Fast weighted LS-SVM [221] can be viewed as an iterative updating procedure 
from LS-SVM. A heuristic weight-setting method derives from the idea of outlier 
mining. Algorithms mining prior knowledge of data set can be effectively used in 
the weight-setting stage of weighted LS-SVM. It reaches final results of weighted 
LS-SVM within much less computational time. 

For two-dimensional samples such as images, MatLSSVM [214] is a classifier 
design method based on matrix patterns, such that the method can not only 
directly operate on original matrix patterns, but also efficiently reduce memory 
for the weight vector from d,d2 to dı + dz. MatLSSVM inherits LS-SVM’s exis- 
tence of unclassifiable regions when extended to multiclass problems. A fuzzy 
version of MatLSSVM [214] removes unclassifiable regions effectively for multi- 
class problems. 
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The iterative reweighted LS procedure for solving SVM [156] solves a sequence 
of weighted LS problems that lead to the true SVM solution. Iterative reweighted 
LS is also applicable for solving regression problems. The procedure is shown to 
converge to the SVM solution [157]. 


SVM training methods 


SVM algorithms with reduced kernel matrix 


To reduce the time and space complexities, a popular technique is to obtain low- 
rank approximations on the kernel matrix, by using the Nystrom method [224], 
greedy approximation [180], or matrix decomposition [58]. A method suggested in 
[180] approximates the data set in the feature space by a set in a low-dimensional 
subspace. A small subset of data points is randomly selected to form the basis 
of the approximating subspace. All other data points are then approximated by 
linear combinations of the elements of the basis. The basis is built iteratively, 
each new candidate element is chosen by a greedy method to reduce the bound 
on the approximation error as much as possible. The QP subproblem solver used 
is loqo, which is an interior-point-method solver provided with the SVMlight 
package. 

The reduced SVM formulation for binary classification is derived from gener- 
alized SVM [132] and smooth SVM [113]. Prior to training, reduced SVM [112] 
randomly selects a portion of the data set so as to generate a thin rectangular 
kernel matrix, which is then used to replace the full kernel matrix in the non- 
linear SVM formulation. Reduced SVM uses a nonstandard SVM cost function. 
No constraints are needed and a quadratically converging Newton algorithm can 
be used for training. The time complexity of the optimization routine is O(N). 
Though it has higher training errors than SVM, reduced SVM has comparable, or 
sometimes slightly better, generalization ability [114]. On some small data sets, 
reduced SVM performs even better than SVM. The technique of using a reduced 
kernel matrix has been applied to other kernel-based learning algorithms, such 
as proximal SVM [64], e-smooth SVR [114], Lagrangian SVM [133], active set 
SVR [140], and LS-SVM [183]. 

Lagrangian SVM [133] is a fast and extremely simple iterative algorithm, capa- 
ble of classifying data sets with millions of points. The full algorithm is given 
in 11 lines of MATLAB code without any special optimization tools such as LP 
or QP solvers. For nonlinear kernel classification, Lagrangian SVM can handle 
any positive semidefinite kernel and is guaranteed to converge. For a positive 
semidefinite nonlinear kernel, an inversion of a single N x N Hessian matrix of 
the dual is required. Hence, Lagrangian SVM can handle only intermediate size 
problems. 

Like LS-SVM, proximal SVM [64] replaces the inequality by equality in the 
defining constraint structure of the SVM framework and uses the LS concept. It 
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replaces the absolute error measure by the squared error measure in defining the 
minimization problem. 

Generalized eigenvalue proximal SVM (GEPSVM) [135] relaxes the parallelism 
condition on proximal SVM. It classifies points by assigning them to the closest 
of two nonparallel planes which are generated by their generalized eigenvalue 
problems. A simple geometric interpretation of GEPSVM is that each plane is 
closest to the points of its own class and farthest to the points of the other class. 

Twin SVM [88] determines two nonparallel proximal hyperplanes by solving 
two smaller related SVM-type problems. It aims at generating two nonparallel 
hyperplanes such that each plane is closer to one of the two classes and is as 
far as possible from the other. This makes twin SVM almost four times faster 
than SVM. The twin SVM formulation is in the spirit of proximal SVMs via 
generalized eigenvalues. Twin SVM is not only fast, but compares favorably 
with SVM and GEPSVM in terms of generalization. When twin SVMs are used 
with a nonlinear kernel, a classifier may be obtained very rapidly for unbalanced 
data sets. In order to increase the efficiency of twin SVM, a coordinate-descent 
margin based twin SVM [173] leads to very fast training. It handles one data 
point at a time, and can process very large data sets that need not reside in 
memory. 

A direct method [227] that is similar to reduced SVM is to build sparse ker- 
nel learning algorithms by adding one more constraint to the convex optimiza- 
tion problem, such that the sparseness of the resulting kernel machine is explic- 
itly controlled while performance is kept as high as possible. A gradient-based 
approach solves this modified optimization problem by adding an explicit spar- 
sity constraint to the multiple-kernel learning of SVM [81]. The desired kernel is 
simply a convex combination of the given base kernels. The resultant classifier 
is both compact and accurate, and ensures good generalization. 


v-SVM 


v-SVM [166] is a class of support vector algorithms for regression and classifi- 
cation. Parameter v controls the number of support vectors, and it enables to 
eliminate one of the other free parameters of the algorithm. v-SVM and SVM 
are two different problems with the same optimal solution set [20]. Compared to 
SVM, the formulation of v-SVM is more complicated. A decomposition method 
for v-SVM [20] is competitive with existing methods for SVM. The decomposi- 
tion algorithm for v-SVR is similar to that for v-SVM. v-SVR is a modification 
of the e-SVR algorithm, and it automatically minimizes £. The implementation 
is part of LIBSVM. 

In [86], the geometrical meaning of SVMs with L, norm is investigated. The 
v-SVM(p) solution wy, is closely related to the v-SVM solution we and has little 
dependency on p, and the generalization error barely depends on p. These results 
are applicable to SVR, since it has a similar geometrical structure. 
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Par-v-SVM [77] is a modification of v-SVM for regression and classification, 
and the use of a parametric-insensitive/margin model with an arbitrary shape 
is demonstrated. As in v-SVM, v is used to control of the number of errors and 
support vectors. By devising a parametric-insensitive loss function, par-v-SVR 
automatically adjusts a flexible parametric-insensitive zone of arbitrary shape 
and minimal radius to include the given data. 

A common approach to classifier design is to optimize the expected misclassi- 
fication (Bayes) cost. Often, this approach is impractical because either the prior 
class probabilities or the relative cost of false alarms and misses are unknown. 
Two alternatives to the Bayes cost for the training of SVM classifiers are the 
minimax and Neyman-Pearson criteria, which require no knowledge of prior class 
probabilities or misclassification costs [45]. Cost-sensitive extensions of SVM and 
v-SVM are 2C-SVM [152] and 2v-SVM [33]. 2C-SVM is proved equivalent to 2v- 
SVM [45]. 


Cutting-plane technique 


For large-scale L1-SVM, SVMperf [92] uses a cutting-plane technique to obtain 
the solution of (16.9). The cutting-plane algorithm [198] is a general approach 
for solving problem (16.6). It is based on iterative approximation of the risk 
term by cutting planes. It solves a reduced problem obtained by substituting the 
cutting-plane approximation of the risk into the original problem (16.6). The 
cutting-plane model makes it straightforward to add basis vectors that are not 
in the training set. A closely related method [227] explores training SVMs with 
kernels that can represent the learned rule using arbitrary basis vectors, not just 
the support vectors from the training set [94]. Bundle method is applied in [182], 
and SVMperf is viewed as its special case. 

It can be shown that cutting-plane methods converge to an ¢-accurate solution 
of the regularized risk minimization problem in O(1/¢A) iterations, where A is 
the trade-off parameter between the regularizer and the loss function [199]. 

An optimized cutting-plane algorithm [61] solves large-scale risk minimiza- 
tion problems by extending standard cutting-plane algorithm [198]. An efficient 
line-search procedure for the optimization of (16.6) is the only additional require- 
ment of the optimized cutting-plane algorithm compared to standard cutting- 
plane algorithm. The number of iterations the optimized cutting-plane algo- 
rithm requires to converge to an ¢-precise solution is approximately O(N). An 
optimized cutting-plane algorithm-based linear binary SVM solver outperforms 
SVMlight, SVMperf and the cutting-plane algorithm, achieving a speedup fac- 
tor of more than 1,200 over SVMlight on some data sets and a speedup factor 
of 29 over SVMperf, while obtaining the same precise support vector solution. 
A cutting-plane algorithm-based linear binary SVM solver often shows faster 
convergence than gradient descent and Pegasos, and its linear multiclass version 
achieves a speedup factor of up to 10 compared to multiclass SVM. 
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For an equivalent 1-slack reformulation of the linear SVM training problem, the 
cutting-plane method has time complexity O(N) [93]. In particular, the number 
of iterations does not depend on JN, and it is linear in the desired precision and 
the regularization parameter. The cutting-plane algorithm includes the training 
algorithm of SVMperf [92] for linear two-class SVMs as a special case. In [93], 
not only individual data points are considered as potential support vectors, but 
also linear combinations of those. This increased flexibility allows for solutions 
with far fewer nonzero dual variables, leading to the small cutting-plane models. 

The cutting-plane subspace pursuit method [94], like basis pursuit methods, 
iteratively constructs the basis set. The method is efficient and modular. Its 
classification rules can be orders of magnitude sparser than the conventional 
support-vector representation while providing comparable prediction accuracy. 
The algorithm produces sparse solutions that are superior to approximate solu- 
tions of the Nystrom method [224], incomplete Cholesky factorization [58], core 
vector machine, ball vector machine [205], and LASVM with margin-based active 
selection and finishing [11]. Both the Nystrom method and incomplete Cholesky 
factorization are implemented in SVMperf. 


Gradient-based methods 


Successive overrelaxation [131] is a derivative of the coordinate-descent method. 
It updates only one variable at each iteration. The method is used for solving 
symmetric linear complementarity problem and quadratic programs to train an 
SVM with very large data sets. On smaller problems the successive overrelaxation 
method is faster than SVMlight and SMO. 

The oLBFGS algorithm [171] compares the derivatives g; ı(w:-1) and 
gi: (wz) for an example (a;-1,y:-1). Compared to the first-order stochastic 
gradient descent, each iteration of oL.BFGS computes the additional quantity 
g,—1 (wz) and updates the list of k rank-one updates. Setting the global learning 
gain is very difficult [13]. 

The stochastic gradient-descent quasi-Newton (SGD-QN) algorithm [12] 
together with corrected SGD-QN [13] is a stochastic gradient-descent algorithm 
for linear SVMs that makes use of second-order information and splits the param- 
eter update into independently scheduled components. It estimates a diagonal 
rescaling matrix using a technique inspired by oLBFGS. SGD-QN iterates nearly 
as fast as a first-order gradient descent, but requires less iterations to achieve the 
same accuracy. Stochastic algorithms often yield the best generalization perfor- 
mance in spite of being worst optimizers. Corrected SGD-QN [13] discovers sen- 
sible diagonal scaling coefficients. Similar speed improvements can be achieved 
by simple preconditioning techniques such as normalizing the means and the 
variances of each feature and normalizing the length of each example. Corrected 
SGD-QN can adapt automatically to skewed feature distributions or very sparse 
data. 
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To solve linear SVM in large-scale scenarios, modified Newton methods for 
training L2-SVM are given in [99], [134]. For L2-SVM, a single-variable piecewise 
quadratic function (16.10) is minimized, which is differentiable but not twice 
differentiable. To obtain the Newton direction, they use the generalized Hessian 
matrix. The trust-region Newton method (TRON) [124] is a fast implementation 
for Lə-SVM. A trust-region Newton method for logistic regression (TRON-LR) 
is proposed in [124]. 

For L2-SVM, a coordinate-descent method [235] updates one component of w 
at a time while fixing the other variables by solving a one-variable subproblem 
by applying a modified Newton method with the line-search technique similar 
to the trust-region method [23]. With a necessary condition of convexity, the 
coordinate-descent method maintains a strict decrease of the function value. In 
[23], the full Newton step is used if possible, thus leading to faster convergence, 
more efficient and stable than Pegasos and TRON; the method is proved to 
globally converge to the unique minimum at the linear rate. 


Training SVM in the primal formulation 


Literature on SVM mainly concentrates on the dual optimization problem. Dual- 
ity theory provides a convenient way to deal with the constraints. The dual opti- 
mization problem can be written in terms of dot products, thereby making it 
possible to use kernel functions. The primal QP problem can be prohibitively 
large while its Wolfe dual QP problem is considerably smaller. For solving the 
problem in the primal, the optimization problem is mainly written as an uncon- 
strained one and the representer theorem is used. It is common to employ a 
two-stage training process where the first stage produces an approximate solu- 
tion to the dual QP problem and the second stage maps this approximate dual 
solution to an approximate primal solution. In terms of both the solution and 
the time complexity, when it comes to approximate solution, primal optimiza- 
tion is superior because it is directly focused on minimizing the primal objective 
function [26]. Also, the corresponding implementation is very simple and does 
not require any optimization libraries. 

A wide range of machine learning methods can be described as the uncon- 
strained regularized risk minimization problem (16.6), where w € R” denotes 
the parameter vector to be learned, $||w]|? is a quadratic regularization term, 
C > 0 is a fixed regularization constant and the second term is a nonnegative 
convex risk function approximating the empirical risk. Using the primal formula- 
tion (16.6) is efficient when N is very large and the dimension of the input data 
is moderate or the inputs are sparse. 

Primal optimization of linear SVMs has been studied in [99], [134]. The finite 
Newton method [134] is a direct primal algorithm for L2-SVM that exploits the 
sparsity property. It is rather effective for linear SVM. In [99], the finite Newton 
method is modified by bringing CG techniques to implement the Newton itera- 
tions to obtain a very fast method for solving linear SVMs with Lə loss function. 
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The method is much faster than decomposition methods such as SVMlight, SMO 
and BSVM [79] (e.g., 4-100 fold), especially when the number of examples is 
large. For linear SVMs, the primal optimization is definitely superior to the dual 
optimization [99]. 

Primal optimization of nonlinear SVMs has been implemented in smooth SVM 
[113]. On larger problems, smooth SVM is comparable or faster than SVMlight, 
successive overrelaxation [131] and SMO. The recursive finite Newton method 
[26] can efficiently solve the primal problem for both linear and nonlinear SVMs. 
Performing Newton optimization in the primal yields exactly the same compu- 
tational complexity as optimizing the dual. In [85], algorithms that accept an 
accuracy €p of the primal QP problem is described as an input and they are 
guaranteed to produce an approximate solution that satisfies this accuracy in 
low-order polynomial time. 

Following the manifold regularization approach, Laplacian SVMs have shown 
excellent performance in semi-supervised classification. Two strategies presented 
in [138] solve the primal Laplacian SVM problem, in order to overcome some 
issues of the original dual formulation. In particular, training a Laplacian SVM in 
the primal can be efficiently performed with preconditioned CG. Training is sped 
up by using an early stopping strategy based on the prediction on unlabeled data 
or, if available, on labeled validation examples. The computational complexity 
of the training algorithm is reduced from O(N?) to O(kN?), where N is the 
combined number of labeled and unlabeled examples and k < N. 

Another approach to the primal optimization is based on decomposing the ker- 
nel matrix and thus effectively linearizing the problem. Among the most efficient 
solvers are Pegasos [172] and stochastic gradient descent (http: //leon.bottou. 
org/projects/sgd) [15], both of which are based on stochastic (sub-)gradient 
descent. Pegasos is an efficient primal estimated sub-gradient solver for L1-SVM 
which alternates between stochastic gradient-descent steps and projection steps, 
and it outperforms SVMperf. 


Clustering-based SVM 


Clustering-based SVM [231] maximizes the SVM performance for very large data 
sets given a limited amount of resource. It applies a hierarchical micro-clustering 
algorithm BIRCH to get finer descriptions at places close to the classification 
boundary and coarser descriptions at places far from the boundary. The training 
complexity is O(n?) when having n support vectors. Clustering-based SVM can 
be used to classify very large data sets of relatively low dimensions. It performs 
especially well where random sampling is not effective. This occurs when the 
important data occur infrequently or when the incoming data includes irregular 
patterns, resulting in different distributions between training and testing data. 
Bit-reduction SVM [105] is a simple strategy to speed up the training and pre- 
diction procedures for an SVM. It groups similar examples together by reducing 
their resolution. Bit reduction reduces the resolution of the input data and groups 
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similar data into one bin. A weight is assigned to each bin according to the num- 
ber of examples from a particular class in it, and a weighted example is created. 
This data reduction and aggregation step is very fast and scales linearly with 
respect to the number of examples. Then, an SVM is built on a set of weighted 
examples which are the exemplars of their respective bins. Optimal compression 
parameters need only to be computed once and can be reused if data arrive incre- 
mentally. It is typically more accurate than random sampling when the data are 
not overcompressed. 

Multi-prototype SVM [2] extends multiclass SVM to multiple prototypes per 
class. It allows to combine several vectors in a principled way to obtain large 
margin decision functions. This extension defines a non-convex problem. The 
algorithm reduces the overall problem into a series of simpler convex problems. 
The approach compares favorably versus LVQ. 

In subspace-based SVMs [103], an input vector is classified into the class with 
the maximum similarity. For each class we define the weighted similarity measure 
using the vectors called dictionaries that represent the class, and optimize the 
weights so that the margin between classes is maximized. Introducing slack vari- 
ables, these constraints are defined either by equality or inequality constraints. 
Subspace-based LS-SVMs and subspace-based LP SVMs are obtained [103]. 

The FCNN-SVM classifier [5] combines the SVM approach and the fast 
nearest-neighbor condensation classification rule (FCNN) in order to make SVM 
practical on large collections of data. On very large and multidimensional data 
sets, FCNN-SVM training is one or two orders of magnitude faster than SVM, 
and the number of support vectors is more than halved with respect to SVM, at 
the expense of a little loss of accuracy. 

Data-specific knowledge can be incorporated into existing kernels. The data 
structure for each class can be first found adaptively in the input space via 
agglomerative hierarchical clustering, and a weighted Mahalanobis distance ker- 
nel is then constructed using the detected data distribution information [215]. 
In weighted Mahalanobis distance kernels, the similarity between two pattern 
images is determined not only by the Mahalanobis distance between their cor- 
responding input patterns but also by the sizes of the clusters they reside in. 
Although weighted Mahalanobis distance kernels are not guaranteed to be pos- 
itive definite or conditionally positive definite, satisfactory classification results 
can still be achieved. 


Other methods 


ALMA, [66] is an incremental learning algorithm which approximates the maxi- 
mal margin hyperplane with regard to norm p > 2 for a set of linearly separable 
data. By avoiding QP methods, ALMA, is very easy to implement and is as 
fast as the perceptron algorithm. The accuracy levels achieved by ALMAg are 
superior to those achieved by incremental algorithms such as the perceptron 
algorithm and an Euclidean norm based ROMMA algorithm [117], but slightly 
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inferior to that achieved by SVM. Compared to SVM, the ALMA» solution is 
significantly sparser. On the other hand, ALMA% is much faster and easier to 
implement than SVM training. When learning sparse target vectors (typical in 
text processing tasks), ALMA, with p > 2 largely outperforms perceptron-like 
algorithms such as ALMA». ALMAg operates directly on (an approximation to) 
the primal maximal margin problem. ALMA, is a large margin variant of the 
p-norm perceptron algorithm. 

In recursive SVM [193], several orthogonal directions that best separate the 
data with the maximum margin are obtained. A completely orthogonal basis can 
be derived in feature subspace spanned by the training samples and the margin 
decreases along the recursive components in linearly separable cases. Compared 
with LDA and regular SVM, the method has no singularity problems and can 
further improve the accuracy. 

Any convex set of kernel matrices is a set of semidefinite programs (SDP). The 
kernel matrix can be learned from data via SDP techniques [109], obtaining a 
method for learning both the model class and the function without local min- 
ima. This approach leads to a convex method to learn the L2-norm soft-margin 
parameter in SVMs. 

A fast SMO procedure [104] solves the dual optimization problem of potential 
SVM. It consists of a sequence of iteration steps in which the Lagrangian is 
optimized with respect to either one (single SMO) or two (dual SMO) of the 
Lagrange multipliers while keeping the other variables fixed. Potential SVM is 
applied using dual SMO, block optimization, and ¢-annealing. In contrast to 
SVMs, potential SVM is applicable to arbitrary dyadic data sets. Dyadic data 
are based on relationships between objects. For problems that are also solvable 
by standard SVM methods, computation time of potential SVM is comparable 
to or somewhat higher than SVM. The number of support vectors found by 
potential SVM is usually much smaller for the same generalization performance. 

The LASVM algorithm [11] performs SMO during learning. It allows efficient 
online and active learning. In the limit of arbitrarily many epochs, LASVM 
converges to the exact SVM solution [11]. LASVM use the £,-norm of the slack 
variables. In [70], LASVM is considerably improved in learning speed, accuracy 
and sparseness by replacing the working-set selection in the SMO steps. A second- 
order working-set selection strategy, which greedily maximizes the progress in 
each single step, is incorporated. 

Fast local kernel SVM (FaLKM-lib) [169] trains a set of local SVMs on redun- 
dant neighborhoods in the training set and an appropriate model for each query 
point is selected at testing time according to a proximity strategy. The approach 
is scalable for large non high-dimensional data. It achieves high classification 
accuracies by dividing the separation function into local optimization problems. 
The introduction of a fast local model selection further speeds up the learning 
process. The approach has a training time complexity which is sub-quadratic in 
the training set size, and a logarithmic prediction time complexity [169]. 
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Decision-tree SVM [24] uses a decision tree to decompose a given data space 
and train SVMs on the decomposed regions. For data sets whose size can be han- 
dled by standard kernel-based SVM training techniques, the proposed method 
speeds up the training by a factor of thousands, with comparable test accuracy. 

SVM with automatic confidence (SVMAC) [238] calculates the label confi- 
dence value of each training sample. Thus, the label confidence values of all of 
the training samples can be considered in training SVMs. By incorporating the 
label confidence value of each training sample into learning, the corresponding 
QP problems is derived. The generalization performance of SVMAC is superior 
to that of traditional SVMs. In comparison with traditional SVMs, the main 
additional cost of training SVMACs is to construct a decision boundary y for 
labeling the confidence value of each training sample. 

A two-stage training process for optimizing a kernel function [4] is based on 
the understanding that the kernel mapping induces a Riemannian metric in 
the original input space and that a good kernel should enlarge the separation 
between two classes. This two-stage process is modified in [225] by enlarging the 
kernel by acting directly on the distance measure to the boundary, instead of the 
positions of the support vectors as used before. It is a data-dependent method 
for optimizing the kernel function of SVMs. The algorithm is rather simple and 
of low cost, which has only one free parameter. 

SVMs with a hybrid kernel can be designed [191] by minimizing the upper 
bound of the VC dimension. This method realizes an SRM and utilizes a flexible 
kernel function such that a superior generalization over test data can be obtained. 
A hybrid kernel is deveopled using common Mercer kernels. SVM with the hybrid 
kernel outperforms that with a single common kernel in terms of generalization 
power. 

DirectSVM [163] is a very simple learning algorithm based on the proposition 
that the two closest training points of opposite class in a training set are support 
vectors, on the condition that the training points in the set are linearly inde- 
pendent. This condition is always satisfied for soft-margin SVMs with quadratic 
penalties. Other support vectors are found using the following conjecture: the 
training point that maximally violates the current hyperplane is also a support 
vector. DirectSVM converges to a maximal margin hyperplane in M — 2 itera- 
tions, if the number of support vectors is M. DirectSVM has a generalization 
performance similar to other SVM implementations, and is faster than a QP 
approach. 

Time-adaptive SVM [72] generates adaptive classifiers, capable of learning 
concepts that change with time. It uses a sequence of classifiers, each appropriate 
for a small time window but learning all the hyperplanes in a global way. The 
addition of a new term in the cost function of the set of SVMs (that penalizes 
the diversity between consecutive classifiers) produces a coupling of the sequence 
that allows time-adaptive SVM to learn as a single adaptive classifier. 
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Condensed SVM [145] involves integrating the vector combination for SVM 
simplification into an incremental framework for working-set selection in SVM 
training. The integration keeps the number of support vectors to the minimum. 
Condensed SVM achieves generalization performance equivalent to that of nor- 
mal SVMs, but is much faster in both the training and testing phases. 

Sparse support vector classification [83] leads to sparse solutions by automat- 
ically setting the irrelevant parameters exactly to zero. It adopts the Do-norm 
regularization term and is trained by an iteratively reweighted learning algo- 
rithm. The approach contains a hierarchical-Bayes interpretation. A variation of 
the method is equivalent to the Lo-norm classifier [223]. The set covering machine 
[136] tries to find the sparsest classifier making few training errors, producing 
classifiers having good generalization. 

Methods like core vector machine, ball vector machine [205] and LASVM with 
margin-based active selection and finishing [11] greedily select as to which basis 
vectors to include in the classification rule. They are limited to selecting basis 
vectors from the training set. Basis pursuit methods [98], [212] repeatedly solve 
the optimization problem for a given set of basis vectors, and then greedily 
search for vectors to add or remove. Kernel matching pursuit [212] is an effective 
greedy discriminative sparse kernel classifier that is mainly developed for the LS 
loss function. 

Max-min margin machine [82] is a general large margin classifier. It extends 
SVM by considering class structures into decision boundary determination via 
utilizing the Mahalanobis distance. 


Pruning SVMs 


Traditional convex SVM solvers rely on the hinge loss to solve the QP problem. 
Hinge loss imposes no limit on the influence of the outliers. All misclassified 
training instances become support vectors. A theoretical result shows that the 
number n of support vectors grows in proportion to the number of training 
examples [188]. Predicting a new example involves a computational complexity 
of O(n) for n support vectors. It is desirable to build SVMs with a small number 
of support vectors, maintaining the property that their hidden-layer weights are 
a subset of the data (the support vectors). v-SVM [166] and sparse SVMs [98] 
maintain this property. 

The nonconvex ramp loss function can overcome the scalability problems of 
convex SVM solvers. It is amenable to constrained concave-convex procedure 
(CCCP) optimization since it can be decomposed into a difference of convex 
parts. By leveraging the ramp function to avoid the outliers to become sup- 
port vectors, an online learning framework to generate LASVM variants [53] 
leads to a significant reduction in the number of wrongly discarded instances, 
and sparser models compared to LASVM [11], without sacrificing generalization 
performance. 
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In order to reduce the number of support vectors, some methods operate as 
a post-processing step after standard SVM training. In [170] Lı regularization 
is applied on the bias 0 to obtain sparse approximation. A simple but effective 
idea of pruning SVMs based on linear dependence is given in [210], and is further 
developed in [50], which gives an exact algorithm to prune the support vector 
set after an SVM classifier is built. In the regression case, the initial support 
vectors are generated using SVMTorch [40] and then those support vectors that 
are identified as linearly dependent are eliminated. 

The pruning algorithm is structured by building into a newly defined kernel 
row space K and is related to feature space H [119]. By analyzing the overlapped 
information of kernel outputs, a method of pruning SVMs to an architecture 
containing at most M support vectors in the /-dimensional space H is system- 
atically developed in [119]. This results in a decrease in the upper bound for 
support vectors from M +1 [159] to M while retaining the separating hyper- 
plane. The method also circumvents the problem of explicitly discerning support 
vectors in feature space as the SVM formulation does. In [120], the method in 
[210], [50], [119] is generalized by relaxing linear dependence to orthogonal pro- 
jection using an LS approximation in space K. The support vectors are further 
pruned in batches through a clustering technique. 

To overcome the problem of a large number of support vectors, a primal 
method devised in [98] decouples the idea of basis functions from the concept 
of support vectors; it greedily finds a set of kernel basis functions of a specified 
maximum size (dmax) to approximate the SVM primal cost function well; it is 
efficient and roughly scales as O(Nd?,,,.). The method incrementally finds basis 
functions (support vectors) to maximize accuracy, starting with an empty set 
of basis functions. In many cases, the method efficiently forms classifiers which 
have an order of magnitude smaller number of basis functions compared to SVM, 
while achieving nearly the same level of accuracy. 

Discarding even a small proportion of the support vectors can lead to a severe 
reduction in generalization performance. There exist non-trivial cases where the 
reduced set approximation is exact, showing that the support vector set deliv- 
ered by SVM is not always minimal [16]. The solution is approximated using a 
reduced set of vectors that are generally not support vectors, but are computed 
from the original support vector set to provide the best approximation to the 
original decision surface. In [142], the reduction process iteratively selects two 
nearest support vectors belonging to the same class and replaces them by a newly 
constructed one. 

Pruning in LS-SVM is investigated using an SMO-based pruning method [234]. 
It requires solving a set of linear equations for pruning each sample, causing a 
large computational cost. Based on a finite-dimensional approximation of the 
feature map, fixed-size LS-SVM is proposed for handling large data sets, leading 
to a sparse model representation [187]. 

Enlightened by incremental and decremental learning [18], an adaptive pruning 
algorithm for the LS-SVM classifier without solving primal nonsparse LS-SVM 
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is developed in [230], based on a bottom-to-top strategy. Its training speed is 
much faster than SMO for the large-scale classification problems with no noises. 

A pattern selection algorithm based on neighborhood properties [177] selects 
only the patterns that are likely to be located near the decision boundary. A 
neighborhood property is that a pattern located near the decision boundary 
tends to have more heterogeneous neighbors in its class membership. A well- 
known entropy concept can be utilized for the measurement of heterogeneity of 
class labels among k nearest neighbors. And the measure will lead us to estimate 
the proximity accordingly. 

Low-rank modifications to LS-SVM [146] are useful for fast and efficient vari- 
able selection. Recursive feature elimination (RFE) is used for variable selection. 
The method attempts to find the best subset r of input dimensions which lead 
to the largest margin of class separation, using an SVM classifier. Relevant vari- 
ables are selected according to a closed form of the leave-one-out error estimator, 
which is obtained as a by-product of the low-rank modifications. 


Multiclass SVMs 


To solve a multiclass classification problem with SVM, many strategies can be 
adopted. For multiclass SVM, two types of approaches for training and classi- 
fication are mainly applied: to consider all the classes in one big optimization 
problem, or to combine several binary classifiers. 

By considering all the classes in one optimization problem, one creates mul- 
ticlass versions of SVM using the single-machine approach [222], [44]. This 
approach generates a very large optimization problem. The multiclass catego- 
rization problem is cast as a constrained optimization problem with a quadratic 
objective function. An efficient fixed-point algorithm is described for solving this 
reduced optimization problem and its convergence is proved in [44]. In [113], mul- 
ticlass smooth SVM is solved by using a fast Newton-Armijo algorithm, which 
is globally convergent to the unique solution with quadratic time. 

Among the strategies to decompose a multiclass problem proposed are one- 
against-all (one-versus-rest) [161], [210], [222], one-against-one [106], all-against- 
all [106] and error-correcting output codes (ECOC) [47], [3]. Comparative studies 
of these methods can be found in [161] and [80]. The one-against-one and one- 
against-all methods are often recommended because of their lower computational 
cost and conceptual simplicity. These strategies are introduced in Section 20.6. 

SVMs with binary tree architecture [32] reduce the number of binary classifiers 
and achieve a fast decision. The method needs to train m — 1 classifiers and test 
logy m times for the final decision. But to get a good classifier of one node, 
it has to evaluate 2” grouping possibilities with m classes in this node. An 
architecture named binary tree of SVM [56] achieves high classification efficiency 
for multiclass problems. Binary tree of SVM and centered binary tree of SVM 
decrease the number of binary classifiers to the greatest extent. In the training 
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Figure 16.4 A decision tree for classification. The discriminant functions D;(a) define class boundary. 


phase, binary tree of SVM has m — 1 binary classifiers in the best situation, 
while it has log4/3((m + 3)/4) binary tests on average when making a decision. 
Maintaining comparable accuracy, the average convergent efficiency of binary 
tree of SVM is log,((m + 3)/4); it is much faster than directed acyclic graph 
SVM (DAG-SVM) [106] and ECOC in problems with big class number. 

To resolve unclassifiable regions in one-against-all strategy, in decision-tree 
based SVMs, we train m — 1 SVMs; the ith (i =1,...,m—1) SVM is trained 
so that it separates data of the ith class from data belonging to one of classes 
i+1,i1+2,...,m. After training, classification is performed from the first to the 
(m — 1)th SVM. If the ith SVM classifies a sample into class i, classification ter- 
minates; otherwise, classification is performed until the data sample is classified 
into the definite class. 

Figure 16.4 shows an example of class boundaries for four classes, when linear 
kernels are used. The classes with smaller class numbers have larger class regions. 
Thus the processing order affects the generalization ability. In a usual decision 
tree, each node separates one set of classes from another set. 


Example 16.4: 

By using STPRtool (http: //cmp.felk.cvut.cz/cmp/software/stprtool/), 
we implement multiclass SVM classification with both the one-against-all and 
one-agaist-one strategies. The SMO binary solver is used to train the binary 
SVM subtasks. Gaussian kernel with ø = 1 is selected, and C is set to 50. The 
training data and the decision boundary are plotted in Fig. 16.5. 


Example 16.5: The US-Postal Service (USPS) handwritten digit database con- 
tains 7291 training and 2007 testing images of 10 handwritten digits with size 
16 x 16. The number of features is 256. Some samples of the digits are shown in 
Fig. 16.6. 

By using LIBSVM, we select C-SVM with the polynomial y(ufv + co) as 
kernel function, where C = 110, y = 1/n, n = 256 is the number of features, 
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Figure 16.5 Multiclass SVM classification using: (a) one-against-all strategy, (b) one-against-one 
strategy. 





Figure 16.6 Sample digits from the USPS database. 


d= 3, and co = 0. A pairwise strategy is employed. The testing accuracy for 
classification is 95.4695% (1916/2007). The total number of supporting vectors 
for the trained SVM is 1445. 


16.8 Support vector regression 


SVM has been extended to regression. The basic idea of SVR is to map the data 
into a higher-dimensional feature space via nonlinear mapping F and then to 
perform linear regression in this space. Regression approximation addresses the 
problem of estimating a function for a given data set D = {(a;, yi)li =1,..., N}, 
where x; € R” and y; € R. An illustration of SVR is given in Fig. 16.7. 

The objective is to find a linear regression 


f(x) = wr +0 (16.30) 
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Figure 16.7 Illustration of the hyperplanes in the two-dimensional feature space of SVR. The objective 
function penalizes examples whose y values are not within (f(x) — €, f(x) + €). Those examples in 
circles are support vectors, against which the margin pushes. 


such that the regularized risk function is minimized: 


N 
. 1 
min R(C) = zlwl? +0 > lv- f @p)lle (16.31) 
p=1 
where the s-insensitive loss function ||- ||e is used to define an empirical risk 
functional, defined by 
[æl = max{0, ||æl| — £}, (16.32) 


and £ > 0 and the regularization constant C > 0 are prespecified. Other robust 
statistics based loss functions such as Huber’s function can also be used. If a data 
point x, lies inside the insensitive zone called the e-tube, i.e., |yp — f(a@p)| < €, 
then it will not incur any loss. 

Introduction of slack variables €, and ¢,, the optimization problem (16.31) can 
be transformed into a QP problem [209]: 





N 
min R(w, €,6) = lwl? +O (& +G) (16.33) 
p=1 
subject to 
(wi a, +0) —y <e+&, p=1,...,N, (16.34) 
Yp — (wi a,+0)<e+G, p=1,...,N, (16.35) 
& 20, GSO, p=1,...,N. (16.36) 


When the error is smaller than £, the slack variables ¿p and ¢, take zero. The 
first term, 4||w]|?, is used as a measurement of function flatness. 

By replacing x by ¢(a), linear regression is generalized to kernel-based regres- 
sion estimation, and regression is performed in the kernel space. Define the 
kernel function that satisfies Mercer’s condition, k(x, y) = ¢7 (a)¢(y). Apply- 
ing the Lagrange multiplier method, we get the following optimization problem 
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[209, 210}: 
1 N N 
min L(a, 6) = 5 XOY (ap — Bp) (ai — (i) k (Ep, £i) 
p=1 i=1 
N N 
+X (op — Br) ¥p + D> (ap + Bp) € (16.37) 
p=1 p=1 
subject to 
N 
(ap — Bp) =0, p=l,...,N, (16.38) 
p=1 
0 < ap, bp S< C, p=l,...,N, (16.39) 


where the Lagrange multipliers a, and 8p, respectively, correspond to (16.34) 
and (16.35). 
The SVM output generates the regresssion 


N 
ule) = f(@) =), (Bp — ap) k (ep, 2) +9, (16.40) 
p=1 
where 0 can be solved using the boundary conditions. Those vectors with a, — 
Bp #0 are called support vectors. SVR has the sparseness property. 

The above -SVR model is formulated as a convex QP problem. Solving a 
QP problem needs O(N?) memory and time resources. The idea of LS-SVM has 
been extended for SVR, [186]. 

SVR is a robust method due to the introduction of the ¢-insensitive loss func- 
tion. Varying £ influences the number of support vectors and thus controls the 
complexity of the model. The choice of € reduces many of the weights a, — Bp 
to zero, leading to a sparse solution in (16.39). Kernel selection is application- 
specific. Bounded influence SVR [52] downweights the influence of outliers in all 
the regression variables. It adopts an adaptive weighting strategy, which is based 
on a robust adaptive scale estimator for large regression residuals. 

The performance of SVR is sensitive to the hyperparameters, and appears 
in the underfitting and overfitting situations when the hyperparameters are not 
chosen properly. To overcome the difficulty of selecting £, the v-SVR model [166], 
[21] automatically adjusts the width of the tube so that at most a fraction v of the 
data points lie outside the tube. v-SVR is a batch learning algorithm. Through 
an approximation of the SVR model, parameter C can be dropped by considering 
its relation with the rest of SVR hyper-parameters (y and £). Bounds for y and 
£ are obtained in [149] for the Gaussian kernel function. 

Other formulations of the SVR problem that minimizes the Lı-norm of the 
parameters can be derived to yield an LP problem [179], leading to the sparsity 
of support vectors or the ability to use more general kernels. In [22], various 
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leave-one-out bounds for SVR are derived and the difference from those for clas- 
sification is discussed. The proposed bounds are competitive with Bayesian SVR 
for parameter selection. 

Regularization path algorithms [217] explore the path of possibly all solutions 
with respect to some regularization hyperparameter for model selection in an eff- 
cient way. For e-SVR, £ is required be set a priori. An e-path algorithm possesses 
the desirable piecewise linearity property. It possesses competitive advantages 
over the A-path algorithm [73], which computes the entire solution path of SVR. 
An unbiased estimate for the degrees of freedom of the SVR model allows con- 
venient selection of the regularization parameter [73]. The -path algorithm has 
a very simple initialization step and is efficient in finding a good regression func- 
tion with the desirable sparseness property that can generalize well. It initializes 
the tube width ¢ to infinity, implying that it starts with no support vectors. It 
then reduces £ so that the number of support vectors increases gradually. 

e-smooth SVR [114] applies the smoothing technique for smooth SVM [113] to 
replace the ¢-insensitive loss function by accurate smooth approximation so as 
to solve e«-SVR as an unconstrained minimization problem by using the Newton- 
Armijo method. For -smooth SVR, only a system of linear equations needs to 
be solved iteratively. In the linear case, e-smooth SVR is much faster than SVR 
implemented by LIBSVM and SVMlight while with comparable correctness. 

SVR is formulated as a convex QP problem with pairs of variables. Some 
SVR-oriented SMO algorithms make use of the close relationship between a; 
and aj. In [181], the method selects two pairs of variables (or four variables): 
Qi, a, a; and aj at each step according to a strategy similar to SMO, and 
solves the QP subproblem with respect to the selected variables analytically. 
The method for updating the bias is inefficient, and some improvements are 
made in [176] based on SMO for classification problems. Nodelib [59] includes 
some enhancements to SMO, where a; and aj are selected simultaneously. The 
QP problems with 2! variables can be transformed into nonsmooth optimization 
problems with | variables Ø; = a; — â;, i =1,2,...,1, and an SMO algorithm 
solves these nonsmooth optimization problems. 

SVMTorch [40] is a decomposition algorithm for regression problems, which is 
similar to SVMlight for classification problems. A convergence proof exists for 
SVMTorch [40]. SVMTorch selects a; independently of their counterparts až. 
SVMTorch is usually many times faster than Nodelib [59], and training time 
generally scales slightly less than O(N). Subproblems of size 2 is solved analyt- 
ically, as is done in SMO. A cache-keeping part of the kernel matrix enables the 
program to solve large problems without keeping quadratic resources in memory 
and without recomputing every kernel evaluation. 

The global convergence of a general SMO algorithms for SVR is given in [190] 
based on the formulation given in [59]. By using the same approach as in [100], 
the algorithm is proved to reach an optimal solution within a finite number of 
iterations if two conditions are satisfied [190]. 


ww ai bbt.com DOOOO00 


Support vector machines 521 


Loss functions for regression problems are derived by symmetrization of 
margin-based losses commonly used in boosting algorithms, namely, the logis- 
tic loss and the exponential loss [46]. The resulting symmetric logistic loss can 
be viewed as a smooth approximation to the ¢-insensitive hinge loss used in 
SVR. Both batch and online algorithms are presented for solving the resulting 
regression problems [46]. 

SVR lacks the flexibility to capture the local trend of data. Localized SVR 
[229] adapts the margin locally and flexibly, while the margin in SVR is fixed 
globally. It can be regarded as the regression extension of the max-min margin 
machine [82]. The associated optimization of localized SVR can be relaxed as 
a second-order cone programming (SOCP) problem, which can attain global 
optimal solution in polynomial time. Kernelization is applicable to the localized 
SVR model. 

In the spirit of twin SVM [88], twin SVR [154] aims at generating a pair of 
nonparallel -insensitive down- and up-bound functions for the unknown regres- 
sor. It only needs to solve two smaller sized QP problems instead of the large 
one as in classical SVR, thus making the twin SVR work faster than SVR, with 
comparable generalization. By introducing a quadratic function to approximate 
its loss function, primal twin-SVR [155] directly optimizes the pair of QP prob- 
lems of twin SVR in the primal space based on a series of sets of linear equations. 
Primal twin-SVR can obviously improve the learning speed of twin SVR without 
loss of the generalization. 

A recursive finite Newton method for nonlinear SVR in the primal is presented 
in [10] and it is comparable with dual optimizing methods like LIBSVM 2.82. A 
non-convex loss function for SVR is proposed in [237] in the primal, and then 
the concave-convex procedure is utilized to transform the non-convex optimiza- 
tion to convex one. A Newton-type optimization algorithm is developed, which 
can not only retain the sparseness of SVR but also oppress outliers in the train- 
ing examples. In addition, its computational complexity is comparable with the 
existing SVR with convex loss function in the primal. 

Support vector ordinal regression [175] attempts to find an optimal mapping 
direction w and r — 1 thresholds, b1, ...,br—1, which define r — 1 parallel discrim- 
inant hyperplanes for the r ranks accordingly. This formulation is improved in 
[38] by including the ordinal inequalities on the thresholds bı < bo < ... < br-1, 
and a good generalization performance is achieved with an SMO-type algorithm. 
Support vector ordinal regression approaches proposed in [38] optimize multiple 
thresholds to define parallel discriminant hyperplanes for the ordinal scales. The 
size of these optimization problems is O(N). 


Example 16.6: Consider a function 


y = x + 0.5exp(—10z°) +0.1N, xz € [-1,1], 
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Figure 16.8 Approximation of the samples. 


16.9 


where N is added Gaussian noise with mean 0 and variance 0.2. We generate 
101 training data from the equation. We use LIBSVM and select «-SVR for 
regression. By selecting C = 100, y = 100 and RBF kernel exp(—7y|u — v|?), the 
obtained MSE is 0.0104 and the corresponding number of supporting vectors is 
38. The result is shown in Fig. 16.8. 


Support vector clustering 


Support vector clustering [8] uses a Gaussian kernel to transform the data points 
into a high-dimensional feature space. Clustering is conducted in the feature 
space and is then mapped back to the data space. The approach attempts to 
find in the feature space the smallest sphere of radius R that encloses all the 
data points in a set {£p} of size N. It can be described by minimizing 


N 
E(R,¢6)=R+CS & (16.41) 
p=1 
subject to 
lø (2p) - cl? < R?+ 6, p=1,...,N, (16.42) 
ĉ& 20, p=1,...,N, (16.43) 


where ġ(-) maps a pattern onto the feature space, €p is a slack variable for the 
pth data point, c is the center of the enclosing sphere, and C is a penalty constant 
controlling the noise. 
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Based on the Lagrange multiplier method, the problem is transformed into 


N N N 
min Esyc = xe X. Apaik (£p, £i) — 5 Apk (Lp, £i) (16.44) 
p=1 i=1 p=1 
subject to 
N 
Xœ =l, (16.45) 
p=1 
O<a,<C, p=1,...,N, (16.46) 


where k(-) is selected as the Gaussian kernel and a, is the Lagrange multiplier 
corresponding to the pth data point. The width o of the Gaussian kernel controls 
the cluster scale while the soft margin €, helps in coping with the outliers and 
overlapping clusters. By varying a, and ép, support vector clustering maintains a 
minimal number of support vectors so as to generate smooth cluster boundaries 
of arbitrary shape. 

The distance between the mapping of an input pattern and the spherical center 
can be computed as 


d?(w,c) = ||ġ(z) — ell? 


N N 
= k(x, x£) — 2X apk (£p, 2) +X X apaik (£p, £i). (16.47) 
p=1 p=1 i=1 
Those data points that are on the boundary of the contours are support vectors. 
A support function is defined as a positive scalar function f : R” — Rt, where 
a level set of f estimates a support of a data distribution. The level set of f can 
normally be decomposed into several disjoint-connected sets 


Ly(r) ={x € R” : f(x) <r} =C1U- UCm, (16.48) 


where C; are disjoint-connected sets corresponding to different clusters and m is 
the number of clusters determined by f. A support function in support vector 
clustering is generated by the SVDD method (or one-class SVM) [194], [168]. 
SVDD maps data points to a high-dimensional feature space and finds a sphere 
with minimal radius that contains most of the mapped data points in the feature 
space. This sphere, when mapped back to the data space, can separate into 
several components, each enclosing a separate cluster of points. 

The support vector clustering algorithm consists in general of two main steps: 
SVM training step to estimate a support function and cluster labeling step to 
assign each data point to its corresponding cluster. The time complexity of the 
cluster labeling step is O(N?m) for N data points and m ( « N) sampling 
points on each edge. Support vector clustering has the ability to generate cluster 
boundaries of arbitrary shape and to deal with outliers. By utilizing the concept 
of dynamical consistency, labeling time can be reduced to O(log N) [95]. 
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Figure 16.9 Clustering of a data set using support vector clustering with C = 1 and o = 0.1443. 
©Ben-Hur, Figure 1d of [8]. 


A heuristic rule is used to determine the width parameters of the Gaussian 
kernels and the soft-margin constant. Support vector clustering can be stopped 
when the fraction of support vectors and bounded support vectors exceeds a cer- 
tain threshold (approximately 10% of the data points) [8]. Multisphere support 
vector clustering [34] creates multiple spheres to adaptively represent individual 
clusters. It is an adaptive cell-growing method, which essentially identifies dense 
regions in the data space by finding the corresponding spheres with minimal 
radius in the feature space. It can obtain cluster prototypes as well as cluster 
memberships. 

A cluster validity measure using a ratio of cluster compactness to separation is 
proposed with outlier detection and cluster merging algorithms for support vec- 
tor clustering [216]. The validity measure can automatically determine suitable 
values for the kernel parameter and soft-margin constant as well. An outlier max- 
mean distance ratio is defined as a criterion for distinguishing support vectors 
from outliers. 


Example 16.7: This example is taken from [8]. There are a data set with 183 
samples. The clusters can be separated by selecting o = 0.1443 and C = 1. Sup- 
port vector clustering gives the result in Fig. 16.9, where support vectors are 
indicated by small circles, and clusters are represented by different colors (gray 
scales) of the samples. 


Maximum-margin clustering [228] performs clustering by simultaneously find- 
ing the largest margin separating hyperplane between clusters. The optimization 
problem is a nonconvex integer programming problem which can be reformulated 
and relaxed as a semidefinite program (SDP) and solved based on semidefinite 
programming. Alternating optimization can also be performed on the original 
nonconvex problem [236]. A key step to avoid premature convergence in the 
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resultant iterative procedure is to change the loss function from the hinge loss to 
the Laplacian/square loss to penalize overconfident predictions, leading to more 
accurate and two to four orders of magnitude faster. Generalized maximum- 
margin clustering [207] reduces the scale of the original SDP problem significantly 
(from O(N?) to O(N)). 

A cutting-plane maximum margin clustering algorithm [219] first decomposes 
the nonconvex maximum-margin clustering problem into a series of convex sub- 
problems by using CCCP and the cutting-plane algorithm for solving convex 
programs, then it adopts the cutting-plane algorithm to solve each subproblem. 
It outperforms maximum-margin clustering, both in efficiency and accuracy. The 
algorithm takes O(sN) time to converge with guaranteed accuracy, for N sam- 
ples in the data set and the sparsity s of the data set, i.e., the average number 
of nonzero features of the data samples. The multiclass version of the algorithm 
is also derived in [219]. 

Weighted support vector C-means clustering [218] is a hierarchical design 
method to design a binary hierarchical classification structure. Each node in 
the hierarchy uses a support vector representation and discrimination machine 
classifier, which is a version of SVM that provides good discrimination between 
the true classes and better rejection for false classes. 


Distributed and parallel SVMs 


An increasing number of databases (such as weather, oceanographic, remote 
sensing, financial) are becoming online and distributed. Distributed processing 
naturally emerges when data are acquired in many places with different owners 
and data privacy arises. A distributed SVM algorithm assumes training data 
to come from the same distribution and are locally stored in different locations 
with processing capabilities (nodes). A reasonably small amount of information 
is interchanged among nodes to obtain an SVM solution, which is comparable, 
but a little bit worse, to that of the centralized approach. 

Distributed SVMs and parallel SVMs emphasize global optimality. A dis- 
tributed SVM algorithm can find support vectors locally and process them alto- 
gether in a central processing center. The solution is not global optimal. This 
method can be improved by allowing the data processing center to send support 
vectors back to the distributed data source and iteratively achieve the global 
optimum. This model is slow due to extensive data accumulation in each site. 
Another procedure is a sort of distributed chunking technique, where support 
vectors local to each node are exchanged with the other nodes, the resulting 
optimization subproblems are solved at each node, and the procedure is repeated 
until convergence. 

Two distributed schemes are analyzed in [141]: a naive distributed chunking 
approach, where support vectors are communicated, and the distributed semi- 
parametric SVM, which further reduces the total amount of information passed 
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between nodes. The naive distributed chunking approach is simple to implement 
and has a performance slightly better than that of distributed semiparametric 
SVM. In the distributed semiparametric SVM, no raw data are exchanged. By 
selecting centroids as training patterns plus noise, privacy can be preserved. 

Parallel SVMs are effective for implementation on multiprocessor systems. 
Examples are the matrix multicoloring successive overrelaxation method [131] 
and the variable projection method in SMO [232]. A parallel implementation of 
SVMlight [232] splits the QP problem into smaller subproblems, which are then 
solved by a variable projection method. However, these methods need central- 
ized access to the training data, and therefore, cannot be used in distributed 
classification applications. 

Distributed parallel SVM [129] is implemented for distributed data classifi- 
cation in a general network configuration, namely, strongly connected network. 
Support vectors carry all the classification information of the local data set. Each 
site within a strongly connected network classifies subsets of training data locally 
via SVM, passes the calculated support vectors to its descendant sites, receives 
support vectors from its ancestor sites, recalculates the support vectors, passes 
them to its descendant sites, and so on. SVMlight is used as the local solver. Dis- 
tributed parallel SVM is able to work on multiple arbitrarily partitioned working 
sets and achieve close to linear scalability if the size of the network is not too 
large. The algorithm is proved to converge to a globally optimal classifier (at 
every site) for arbitrarily distributed data over a strongly connected network in 
finite steps. 

In [60], the centralized linear SVM problem is cast as a set of decentralized 
convex optimization subproblems, one per node, with consensus constraints on 
the wanted classifier parameters. Using the alternating direction method of mul- 
tipliers, fully distributed training algorithms are obtained without exchanging 
training data among nodes. The overhead associated with inter-node communi- 
cations is fixed and solely dependent on the network topology rather than the 
size of the training sets available per node. 

A parallel gradient projection based decomposition technique [233] is imple- 
mented based on both the gradient projection QP solvers and the selection rules 
for large working sets. The software implements an iterative decomposition tech- 
nique and exploits both the storage and computing resources available on mul- 
tiprocessor systems. 

HeroSVM [49] uses a block-diagonal approximation of the kernel matrix to 
derive hundreds of independent small SVMs and filter out the examples which 
are estimated to be non-support vectors; then a new serial SVM is trained on the 
collected support vectors. A parallel optimization step is introduced to quickly 
remove most of the non-support vectors. In addition, some effective strategies 
such as kernel caching and efficient computation of kernel matrix are integrated 
to speed up the training process. The algorithm complexity grows linearly with 
the number of classes and size of the data set. It has a much better scaling 
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capability than LIBSVM, SVMlight and SVMTorch, with good generalization 
performance. 


SVMs for one-class classification 


Information retrieval using only positive examples for training is important in 
many applications. Consider trying to classify sites of interest to a web surfer 
where the only information available is the history of the user’s activities. Novelty 
detection is the identification of novel patterns of which the learning system is 
trained with a few samples. This problem happens when novel or abnormal 
examples are expensive or difficult to obtain. Novelty detection is usually in 
the context of imbalanced data sets. With samples from both novel and normal 
patterns, novelty detection can be viewed as a usual binary classification problem. 
The purpose of data description, also called one-class classification, is to give a 
compact description of the target data that represents most of its characteristics. 

One-class SVM [8], [194], [168] is a kernel method based on a support vector 
description of a data set consisting of positive examples only. The two well- 
known approaches to one-class classification are separation of data points from 
the origin [168] and spanning of data points with a sphere of minimum volume 
[194], [197]. The first approach is to extract a hyperplane in a kernel feature space 
such that a given fraction of training objects may reside beyond the hyperplane, 
while at the same time the hyperplane has maximal distance to the origin [168]. 
After transforming the feature via a kernel, they treat the origin as the only 
member of the second class using relaxation parameters, and standard two-class 
SVMs are then employed [168]. Both the approaches lead to similar, and in 
certain cases such as Gaussian kernel function, even identical formulations of 
dual optimization problems. If all data are in-liers, one-class SVM computes the 
smallest sphere in feature space enclosing the image of the input data. 

In SVDD [195], [197], the compact description of target data is given in a 
hyperspherical model, which is determined by support vectors as a hypersphere 
(a, R) with minimum volume containing most of the target data. SVDD has 
limitations to reflect overall characteristics of a target data set with respect to 
its density distribution. In SVDD, support vectors fully determine the solution 
of target data description, whereas all of the non-support vectors have no influ- 
ence on the solution of target description, regardless of the density distribution. 
The kernel trick is utilized to find a more flexible data description in a high- 
dimensional feature space [197]. To address the problem in SVDD, a density- 
induced SVDD [116] reflects the density distribution of a target data set by 
introducing the notion of a relative density degree for each data point. By using 
density-induced distance measurements for both target data and negative data, 
density-induced SVDD can shift the center of hypersphere to the denser region 
based on the assumption that there are more data points in a denser region. 
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When information of negative data is available, the method has a performance 
comparable to that of k-NN and SVM. 

Outlier-SVM [130] is based on identifying outliers as representative of the 
second class. In the context of information retrieval, one-class SVM [168] and 
outlier-SVM [130] outperform the prototype, nearest neighbors, and naive Bayes 
methods. While one-class SVM is more robust with regard to smaller categories, 
a one-class neural network method based on bottleneck compression generated 
filters and outlier-SVM give good results by emphasizing success in the larger 
categories. 

By reformulating a standard one-class SVM [168], LS one-class SVM [35] is 
derived with a reformulation very similar to that of LS-SVM. It extracts a hyper- 
plane as an optimal description of training objects in a regularized LS sense. 
LS one-class SVM uses a quadratic loss function and equality constraints, and 
extracts a hyperplane with respect to which the distances from training objects 
are minimized in a regularized LS sense. Like LS-SVM, LS one-class SVM loses 
the sparseness property of standard one-class SVMs. One may overcome the loss 
of the sparseness by pruning the training samples. 


Incremental SVMs 


When samples arrive sequentially, incremental learning is promising for one-class 
classification and active learning. Some incremental learning techniques for SVM 
are given in [63], [18], [117], [196], [137], [143], [213], [203], [158]. Incremental 
SVMs are more efficient than batch SVMs in terms of computational cost. 

Exact incremental SVM learning (http://www. cpdiehl.org/code.htm1) [18] 
updates an optimal solution of an SVM training problem at each step only after 
a training example is added (or removed). It offers an advantage of immedi- 
ate availablility of the exact solution and reversibility, but has a large memory 
requirement, since the set of support vectors must be retained in memory dur- 
ing the entire learning process. Incremental SVM uses the Lı-norm of the slack 
variables. Based on an analysis of convergence and algorithmic complexity of 
exact incremental SVM learning [18], a design using the gaxpy-type updates of 
the sensitivity vector speeds up the training of an incremental SVM by a factor 
of 5 to 20 [110]. 

Following exact incremental SVM learning [18], accurate online SVR [143] 
efficiently updates a trained SVR function whenever a sample is added to or 
removed from the training set. The accurate online SVR technique assumes that 
the new samples and the training samples are of the same characteristics. Accu- 
rate online SVR with varying parameters [147] uses varying SVR. parameters 
rather than fixed ones and hence accounts for the variability that may exist in 
the samples. Examples of adaptive learning algorithms include various improved 
versions of SMO for SVR [176, 59]. 
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Incremental asymmetric proximal SVM (IAPSVM) [160] employs a greedy 
search across the training data to select the basis vectors of the classifier, and 
tunes parameters automatically using the simultaneous perturbation stochastic 
approximation after incremental additions are made. The greedy search strat- 
egy substantially improves the accuracy of the resulting classifier compared to 
reduced-set methods introduced by proximal SVM. IAPSVM compares favorably 
with SVMTorch [40] and core vector machine at reduced complexity levels. 

Kernel Adatron [63] adapts Adatron to the problem of maximum-margin clas- 
sification with kernels. An active set approach to incremental SVM [178] uses a 
warm-start algorithm for training, which takes advantage of natural incremental 
properties of standard active set approach to linearly constrained optimization 
problems. In an online algorithm for Lı-SVM [11], a close approximation of 
the exact solution is built online. This algorithm scales well to several hundred 
thousand examples, however its online solution is not as accurate as the exact 
solution. 

Classical perceptron algorithm with margin is a member of a broader family 
of large margin classifiers collectively called the margitron [153]. The margitron, 
sharing the same update rule with the perceptron, is shown in an incremental 
setting to converge in a finite number of updates to solutions possessing any 
desirable fraction of the maximum margin. 

Core vector machine [203] is based on L2-SVM and scales to several million of 
examples. It approximates a solution to L2-SVM by a solution to the two-class 
minimum enclosing ball problem, for which several efficient online algorithms are 
available. While its scalability is very impressive, the method can lead to higher 
test errors. 

In an improved incremental algorithm for SVM [31], the training set is divided 
into groups and C-means clustering is used to collect the initial set of training 
samples. A weight is assigned to each sample in terms of its distance to the 
separating hyperplane and the confidence factor. In [97], the incremental training 
method uses one-class SVM. A hypersphere is generated for each class and data 
that exist near the boundary of the hypersphere is kept as candidates for support 
vectors while others are deleted. In [121], incremental SVM is developed in the 
primal space. When a sample is added, the method applies the efficient Cholesky 
decomposition technique. 

Online independent SVM [148] approximately converges to the SVM solu- 
tion each time new observations are added; the approximation is controlled via a 
user-defined parameter. The method employs a set of linearly independent obser- 
vations and tries to project every new observation onto the set obtained so far, 
dramatically reducing time and space requirements at the price of a negligible 
loss in accuracy. Online independent SVM produces a smaller model compared 
to that by standard SVM, with a training complexity of asymptotically O(N7). 
It uses the Lə-norm of the slack variables. 
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Like relevance vector machine (RVM) [200], an incremental method for super- 
vised learning given in [206] learns the parameters of the kernels during training. 
Specifically, different parameter values are learned for each kernel, resulting in 
a very flexible model. A sparsity-enforcing prior is used to control the effective 
number of model parameters. 


SVMs for active, transductive and semi-supervised learning 


SVMs for active learning 


An active learning algorithm using SVM identifies positive examples in a data 
set [220]. Selection of next point to be labelled is carried out in the algorithm 
using two heuristics that can be derived from an SVM classifier trained on points 
with known labels. The largest positive heuristic selects the point that has the 
largest classification score among all examples still unlabeled. The near boundary 
heuristic selects the point whose classification score has the smallest absolute 
value. In both cases SVM has to be retrained after each selection. A better way 
is to apply incremental learning. 

Given a set of labeled training data and a Mercer kernel, there is a set of 
hyperplanes that separate the data in the induced feature space. This set of 
consistent hypotheses are called the version space. In pool-based active learning 
[202], the learner has access to a pool of unlabeled instances and can request the 
labels for some of them. An algorithm for performing active learning with SVM 
chooses as to which instances should be requested next. The method significantly 
reduces the need for labeled training instances in both standard inductive and 
transductive settings. 


SVMs for transductive or semi-supervised learning 


In addition to regular induction, SVM can also be used for transduction. SVM 
can perform transduction by finding the hyperplane that maximizes the margin 
relative to both the labeled and unlabeled data. See Fig. 16.10 for an example. 
Transductive SVM has been used for text classification, attaining improvements 
in precision/recall breakeven performance over regular inductive SVM [202]. 
Transduction utilizes the prior knowledge of the unlabeled test patterns [208]. 
It is an essentially easier task than first learning a general inductive rule and 
then applying it to the test examples. Transductive bounds address the perfor- 
mance of the trained system on these test patterns only. When the test and 
training data are not identically distributed, the concept of transduction could 
be particularly worthwhile. Transductive SVM solves the SVM problem while 
treating the unknown labels as additional optimization variables. By maximiz- 
ing the margin in the presence of unlabeled data, one learns a decision boundary 
that traverses through low data-density regions while respecting labels in the 
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Figure 16.10 SVM (solid line) and transductive SVM (dotted line). + signs represent unlabeled 


instances. 


input space. In other words, this approach implements the cluster assumption 
for semi-supervised learning [25]. Transductive SVM learns an inductive rule 
defined over the entire input space. 

Transductive SVM [209] is a method of improving the generalization accuracy 
of SVM by using unlabeled data. It learns a large margin hyperplane classifier 
using labeled training data, but simultaneously forces this hyperplane to be far 
away from the unlabeled data. Transduction (labeling a test set) is inherently 
easier than induction (learning a general rule) [209]. Transductive SVM can 
provide considerable improvement in generalization over SVM, if the number of 
labeled points is small and the number of unlabeled points is large. 

The original transductive SVM problem is described as follows. The train- 
ing set consists of L labeled examples {(a;, yi) 41, yi = +1, and U unlabeled 
examples {x;}/,,,, with N = L +U. Find among the possible binary vectors 
Y = {(yt41,---;yt+u)} the one such that an SVM trained on £ U (U x V) yields 
the largest margin. This is a combinatorial problem, but one can approximate 





it as finding an SVM separating the training set under constraints which force 
the unlabeled examples to be as far as possible from the margin [209]. A primal 
method [25] scales as (L + U)?, and stores the entire (L + U) x (L +U) kernel 
matrix in memory. 

Following a formulation using an integer programming method, transductive 
linear SVM with a Lı-norm regularizer [27], where the corresponding loss func- 
tion is decomposed as a sum of a linear function and a concave function, is 
algorithmically close to CS?VM [65]. SVMLight-TSVM [91] is a combinatorial 
approach that is practical for a few thousand examples. VTSVM [25] is optimized 
by performing gradient descent in the primal space. 

One problem with transductive SVM is that in high dimensions with few train- 
ing examples, it is possible to classify all the unlabeled examples as belonging 
to only one of the classes with a very large margin, which leads to poor per- 
formance. One can constrain the solution by introducing a balancing constraint 
that ensures the unlabeled data are assigned to both classes. The fraction of 
positive and negatives assigned to the unlabeled data can be assumed to be the 
same fraction as found in the labeled data [91]. 
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Transductive SVM learning provides a decision boundary in the entire input 
space and can be considered as inductive, rather than strictly transductive, 
semi-supervised learning. Semi-supervised SVM is based on applying the margin 
maximization principle to both labeled and unlabeled examples. Thus, semi- 
supervised SVMs are inductive semi-supervised methods and not strictly trans- 
ductive [27]. Many techniques are available for solving the non-convex optimiza- 
tion problem associated with semi-supervised SVM or transductive SVM, for 
example, local combinatorial search [91], gradient descent [25], convex-concave 
procedures [65], [41], and SDP [228]. 

In linear semi-supervised SVM, the minimization problem is solved over both 
the hyperplane parameters (w,b) and the label vector yy = (yr41--- yn)” 


? 


L N 

1 
min I(w,b,yy) = slwl? +O V (uoi) +C* $O V(yi,0i), (16.49) 
(w,b),yu 2 i—i i=L+1 
where o; = wT a; +b and the loss function V is usually selected as the hinge 
loss, 


V (yi, oi) = [max(0, 1 — yi0; )]P . (16.50) 


It is common to select either p = 1 or p = 2. Nonlinear decision boundaries can 
be constructed using the kernel trick. 

The first two terms in the objective function (16.49) define SVM. The third 
term incorporates unlabeled data. The loss over labeled and unlabeled examples 
is weighted by two hyperparameters, C and C*, which reflect confidence in the 
labels and in the cluster assumption, respectively. The problem (16.49) is solved 
under the class-balancing constraint [91]: 


N N 
5 2 max(y;,0) =r, or equivalently, 5 pa yi = 2r—1. (16.51) 
i=L+1 i=L+1 
This constraint helps in avoiding unbalanced solutions by enforcing that a cer- 
tain user-specified fraction, r, of the unlabeled data should be assigned to the 
positive class. r is estimated from the class ratio on the labeled set, or from prior 
knowledge of the classification problem. 

CCCP is applied to semi-supervised SVM in [65], [41]. Using CCCP, a large- 
scale training method solves a series of SVM optimization problems with L + 
2U variables. It involves iterative solving of standard dual QP problems, and 
usually requires just a few iterations. This provides a highly scalable algorithm 
in the nonlinear case. Successive convex optimization is performed using an SMO 
implementation. CCCP-TSVM runs orders of magnitude faster than SVMlight- 
TSVM and VTSVM [41]. Both SVMlight-TSVM and VTSVM use an annealing 
heuristic for hyperparameter C™. 

An inductive semi-supervised learning method proposed in [115] first builds 
a trained Gaussian kernel support function that estimates a support of a data 
distribution via an SVDD procedure using both labeled and unlabeled data. 
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Figure 16.11 Illustration of the transductive accuracy in a two-dimensional input space. (a) Boundary 
found by inductive learning with SVM classifier. Only labeled points (+ and o) are used for training. 
(b) Boundary found by transductive learning. All points, labeled and testing (unlabeled), are used for 
training. ©IEEE, 2009 [1]. 


Then, it partitions the whole data space into separate clustered regions. Finally, 
it classifies the decomposed regions utilizing the information of the labeled data 
and the topological structure of the clusters described by the constructed support 
function. Its formulation leads to a non-convex optimization problem. 
S3VMlight [91], the semi-supervised SVM algorithm implemented in SVM- 
light, is based on local combinatorial search guided by a label-switching pro- 
cedure. VS?VM [25] minimizes directly the objective function by gradient 
descent, with a complexity O(N?). In comparison, S?VMlight typically scales 
as O(n + N?) for n support vectors. A loss function for unlabeled points and 
an associated Newton semi-supervised SVM method are proposed in [27] (along 
the lines of [26]), bringing down the complexity of VS?VM. Semi-supervised 
LS-SVM classifier [1] uses the transductive inference formulation and different 
approaches are deduced from the transductive SVM idea to train the classifier. 


Example 16.8: The use of inductive inference for estimating the value of a func- 
tion at given points involves two steps. The training points are first used to 
estimate a function for the entire input space, and the values of that function on 
separate test points are then computed based on the estimated parameters. In 
transductive inference, the end goal is to determine the values of the function at 
the predetermined test points. Figure 16.11 gives the classification boundaries 
of inductive and transductive learning with SVM classifier. Good classification 
is obtained on the testing samples for transductive learning. 
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Probabilisitic approach to SVM 


In Bayesian techniques for support vector classification [36], Bayesian inference 
is used to implement model adaptation, while keeping the merits of support 
vector classifier, such as sparseness and convex programming. A differentiable 
loss function called trigonometric loss function has the desirable characteristic 
of natural normalization in the likelihood function. 

Posterior probability SVM [192] modifies SVM to utilize class probabilities 
instead of using hard —1/ + 1 labels. It uses soft labels derived from estimated 
posterior probabilities so as to be more robust to noise and outliers. The method 
uses a window-based density estimator for the posterior probabilities. It achieves 
an accuracy similar to that of the standard SVM by storing fewer support vectors. 
The decrease in error by posterior probability SVM is due to a decrease in bias 
rather than variance. The method is extended to the multiclass case in [71]. In 
[71], a neighbor-based density estimator is proposed and is also extended to the 
multiclass case. 


Relevance vector machines 


Predictions are not probabilistic in the SVM outputs. Ideally, one prefers to 
estimate the conditional distribution p(y|x) in order to capture uncertainty. It 
is necessary to estimate the error/margin tradeoff parameter C (as well as the 
insensitivity parameter £ in regression); this generally entails a crossvalidation 
procedure. RVM [201] is a Bayesian treatment of the SVM prediction which does 
not suffer from any of these limitations. RVM and informative vector machine 
[111] are sparse probabilistic kernel classifiers in a Bayesian setting. These non- 
SVM models can match the accuracy of SVM, while also bringing down consid- 
erably the number of kernel functions as well as the training cost. 

The basic idea of RVM is to assume a prior of the expansion coefficients which 
favors sparse solutions. Sparsity is obtained because the posterior distributions of 
many of the weights become sharply peaked around zero. Those training vectors 
associated with the remaining nonzero weights are termed relevance vectors. Each 
weight is assumed to be a Gaussian variable. RVM acquires relevance vectors and 
weights by maximizing a marginal likelihood. Through a learning process, the 
data points with very small variances of the weights will be discarded. The most 
contributing kernels are finally selected for a model. Derived by exploiting a 
probabilistic Bayesian learning framework, accurate prediction models typically 
utilize dramatically fewer kernel functions than a comparable SVM does while 
offering a number of additional advantages: probabilistic predictions, automatic 
estimation of parameters, and the facility to utilize arbitrary kernel functions 
(e.g. non-Mercer kernels). 

The primary disadvantage of the sparse Bayesian method is the computational 
complexity of the learning algorithm. Although the presented update rules are 
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very simple in form, they require O(N?) memory and O(N?) computation for 
N training examples. This implies that RVM becomes less practical for large 
training sets [200]. 

Analysis of RVM shows that adopting the same prior for different classes may 
lead to unstable solutions [29]. In order to tackle this problem, a signed and 
truncated Gaussian prior is adopted over every weight in probabilistic classifica- 
tion vector machine (PCVM) [29], where the sign of the prior is determined by 
the class label, i.e., +1 or —1. The truncated Gaussian prior not only restricts 
the sign of the weights but also leads to a sparse estimation of the weight vec- 
tors. PCVM outperforms soft-margin SVM, hard-margin SVM, RVM, and SVM 
with kernel parameters optimized by PCVM (SVMpcym). SVMpcym performs 
slightly better than soft-margin SVM. The superiority of PCVM formulation is 
also discussed using MAP analysis and margin analysis in [29]. 


16.1 Acceptable kernels must satisfy Mercer’s condition. Let kı and kz be ker- 
nels defined on R? x R¢. Show that the following are also kernels: 

(a) k(x, z) = akı (x, z) + bko(a, z), with a,b € R. 

(b) k(a, z) = g(a)g(z), with g(-) being a real-valued function. 


16.2 For Example 16.1, compute the kernel matrix K = [k,;]. Verify its positive 
definiteness. 


16.3 Consider the mapping which maps two-dimensional points (p,q) to three- 
dimensional points (p?, q?, V2pq). Specify the dot product of two vectors in three- 
dimensional space in terms of the dot products of the corresponding points in 
two-dimensional space. 


16.4 For Example 16.1, establish the QP problem, given by (16.12) to (16.14), 
for SVM learning. Solve ap, p = 1,..., 4. 


16.5 Derive the dual optimization problem for a soft-margin SVM. 
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17.1 


Other kernel methods 


Introduction 


The kernel method was originally invented in [3]. The key idea is to project the 
training set in a lower-dimensional space into a high-dimensional kernel (feature) 
space by means of a set of nonlinear kernel functions. As stated by the Cover 
theorem, the data will be more likely linearly separable when they are nonlin- 
early mapped to a higher-dimensional space. The kernel method is a powerful 
nonparametric modeling tool in machine learning and data analysis. Well-known 
examples include kernel density estimator (also called the Parzen window estima- 
tor) as well as the RBF network and SVM. Kernel-based methods have played 
an important role in many fields, such as pattern recognition, approximation, 
modeling and data mining. 

The kernel method generates algorithms that, by replacing the inner product 
with an appropriate positive definite function, implicitly perform a nonlinear 
mapping of the input data into a high-dimensional feature space. Introduced 
with SVM, the kernel trick has attracted much attention because of its efficient 
and elegant way of modeling nonlinear patterns. The kernel trick has been applied 
to construct nonlinear equivalents of a wide range of classical linear statistical 
models. An important advantage of kernel models is that the parameters of the 
model are typically given by the solution of a convex optimization problem, with 
a single, global optimum. 

Reproducing kernel Hilbert spaces (RKHSs), defined by Aronszajn in 1950 
[7], are now commonly used as hypothesis spaces in learning theory. Combined 
with regularization techniques, often they allow good generalization capabilities 
of the learned models and enforce desired smoothness properties of the solutions 
to the learning problems. 

A function k: X x X — R is called a kernel on X if there exists a Hilbert 
space H known as a feature space of k and ¢: X — H as a feature map of k 
with 


k(x, y) =< d(x), d(y) >, Va,ye X. (17.1) 


Notice that both H and ¢ are far from being unique. However, for a given kernel 
there exists an RKHS. In kernel methods, data are represented as functions or 
elements in RKHSs, which are associated with positive definite kernels. Equation 
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(17.1) is commonly referred to as the kernel trick. For complex-value signal pro- 
cessing, one can map the input data into a complex RKHS using pure complex 
kernels or real kernels (via the complexification trick) [11, 12]. 

A function f : X — R is said to be induced by kernel k if there exists an 
element w € H such that f =< w,¢(-) >. A continuous kernel k on a compact 
metric space (X,d) is called universal if the space of all functions induced by k 
is dense in C(X), i.e. for every function f € C(X) and every e > 0 there exists 
a function g induced by k with || f — g|| < e. Every universal kernel separates all 
compact subsets. 

The representer theorem can be stated as follows. 


Theorem 17.1 (Representer theorem). Any function defined in an RKHS 
can be represented as a linear combination of Mercer kernel functions. 


Another way for nonlinear feature generation is the kernels-as-features idea 
[9], where the kernel function is directly considered as features. Given a kernel 
function k(-) and l data {a ,...,a7} of the input space X, we can map each 
x € X into an [-dimensional kernel feature space, called y-space, by defining 
v(x) = (k(w,21),...,k(x,27))7, and then, certain algorithms can be performed 
in the y-space instead of the input space to deal with nonlinearity. 

Both the kernel trick and kernels-as-features ideas produce nonlinear feature 
spaces to perform certain linear algorithms for dealing with nonlinearity. How- 
ever, the feature spaces produced by the two ideas are different: the former is 
implicit and can only be accessed by the kernel function as a black-box program 
of inner product, whereas the latter is explicitly constructed using a set of data 
and a kernel function. An exact equivalence between the two kernel ideas applied 
to PCA and LDA is established in [72]. There is an equivalence up to different 
scalings on each feature between the kernel trick and kernels-as-features ideas 
applied to certain feature extraction algorithms, i.e., LDA, PCA and CCA [120]. 

The notion of refinable kernels [116] leads to the introduction of wavelet-like 
reproducing kernels, yielding multiresolution analysis of RKHSs. Refinable ker- 
nels provide computational advantages for solving various learning problems. 
The dominant set of eigenvectors of the symmetrical kernel Gram matrix is used 
in many important kernel methods in machine learning. An efficient incremental 
approach is presented in [47] for fast calculation of the dominant kernel eigenba- 
sis. 

After the success of SVM, many linear learning methods have been formulated 
using kernels, producing the other kernel-based methods: kernel PCA [93], [100], 
kernel LDA [76], [119], kernel clustering [39], kernel BSS [75], kernel ICA [8], 
kernel CCA [60], and the minimax probability machine [61]. 

Kernel methods are extended from RKHSs to Krein spaces [70] and Banach 
spaces [97]. A class of reproducing kernel Banach spaces with the Lı norm that 
satisfies the linear representer theorem can be applied in machine learning [97]. 
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Kernel PCA 


Kernel PCA [93] introduces kernel functions into PCA. It first maps the orig- 
inal input data into a high-dimensional feature space using the kernel method 
and then calculates PCA in the high-dimensional feature space. Linear PCA in 
the high-dimensional feature space corresponds to a nonlinear PCA in the orig- 
inal input space. The decomposition of a Gram matrix is a particularly elegant 
method for extracting nonlinear features from multivariate data. 

Given an input pattern set {x; € RY|i=1,...,N}, 6: R” —> R”? is a non- 
linear map from the J,-dimensional input to the Jz-dimensional feature space. 
A J>-by-J2 correlation matrix in the feature space is defined by 

N 
Ci = = 5° 6 (ai) 9" (wi). (17.2) 
i=1 
Like PCA, the set of feature vectors is constrained to zero-mean, 
+ an (xi) = 0. A procedure for selecting ¢ is given in [92]. 

The principal components are then computed by solving the eigenvalue prob- 

lem [93, 77] 


N 
1 T 
dv = Civ = Wo (0 (w;) v) @ (z5). (17.3) 
Thus, v must be in the span of the mapped data 
N 
w=) aig (zi). (17.4) 
i=1 


After premultiplying both sides of (17.4) by ¢(a;) and performing mathematical 
manipulations, the kernel PCA problem reduces to 


Ka = da, (17.5) 


where A and a = (aj,...,a@ nw) are, respectively, the eigenvalues and the corre- 
sponding eigenvectors of K, and K is an N x N kernel matrix with 


Kij = k (xi, £j) = o7 (xi) ġ (x;). (17.6) 


Popular kernel functions used in the kernel method are the polynomial, Gaus- 
sian and sigmoidal kernels, which are respectively given by [77] 


k (£, æj) = (wf £j + 0)” ; (17.7) 
le: -æ;ll? 

k (xi, æj) =e 237, (17.8) 

k (xi, £j) = tanh (axf æ; + 8), (17.9) 


where m is a positive integer, o > 0, and a, 0 € R. Even if the exact form of ¢(-) 
does not exist, any symmetric function k(a;,a,;) satisfying Mercer’s theorem 
can be used as a kernel function [77]. 
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Arrange the eigenvalues in descending order, 41 > A2 >... > Az, > 0, and 


denote their corresponding eigenvectors as a ,...,a@j,. The eigenvectors are 
further normalized as 
1 
alak = —. (17.10) 
Ak 


The nonlinear principal components of æ can be extracted by projecting the 
mapped pattern ¢(a) onto vk [93, 77] 


N 
viols) = > ope (zj), k=1,2,..., J, (17.11) 
j=1 


where az; is the jth element of œx. 

Kernel PCA is much more complicated and may sometimes be caught more 
easily in local minima. The kernel PCA method via eigendecomposition involves a 
time complexity of O(N?), with N being the number of training vectors. Second, 
the resulting kernel principal components have to be defined implicitly by linear 
expansions of the training data; thus, all data must be saved after training. 

PCA needs to deal with an eigenvalue problem of a Jı x J, matrix, while 
kernel PCA needs to solve an eigenvalue problem of an N x N matrix. Sparse 
approximation methods can be applied to reduce the computational cost [77]. 
An algorithm proposed in [13] enables us to recover the number of leading kernel 
PCA components relevant for good classification. 

Traditionally, kernel methods require computation and storage of the entire 
kernel matrix and preparation of all training samples beforehand. This require- 
ment can be eliminated by repeatedly cycling through the data set, computing 
kernels on demand, as implemented in decomposition methods for SVM. This 
is done for kernel PCA by the kernel Hebbian algorithm as an online version 
of kernel PCA [53], which suffers from slow convergence. The kernel Hebbian 
algorithm, introduced by kernelizing GHA, has a scalar gain parameter that is 
either held constant or decreased according to a predetermined annealing sched- 
ule. Gain adaptation can improve convergence of the kernel Hebbian algorithm 
by incorporating the reciprocal of the current estimated eigenvalues [42]. Sub- 
set kernel PCA uses a subset of samples for calculating the bases of nonlinear 
principal components. Online subset kernel PCA such as subset kernel Hebbian 
algorithm gradually adds and exchanges a sample in the basis set, thus it can be 
applied to time-varying patterns [111]. 

For kernel PCA, the Lə loss function used is not robust, and outliers can skew 
the solution from the desired one. Kernel PCA lacks of sparseness because the 
principal components are expressed in terms of a dense expansion of kernels asso- 
ciated with every training data point. Some approaches introduce sparseness into 
kernel PCA, e.g., [94], [96]. Introducing the LS-SVM formulation to kernel PCA, 
kernel PCA is extended to a generalized form of kernel component analysis with 
a general underlying loss function made explicit [5]. Robustness and sparseness 
are introduced into kernel component analysis by using an e-insensitive robust 
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loss function [5]. A robust kernel PCA method [45] extends kernel PCA and 
uses fuzzy memberships to tackle the two problems simultaneously. Incremental 
kernel PCA in Krein space does not require the calculation of preimages and 
therefore is both efficient and exact [70]. 

The adaptive kernel PCA method in [31] has the flexibility to accurately 
track the kernel principal components. First, kernel principal components are 
recursively formulated from the recursive eigen-decomposition of kernel covari- 
ance matrix. Kernel covariance matrix is then correctly updated to adapt to 
the changing characteristics of data. In this adaptive method, the kernel princi- 
pal components are adaptively adjusted without re-eigendecomposing the kernel 
Gram matrix. The method not only maintains constant update speed and mem- 
ory usage as the data-size increases, but also alleviates sub-optimality of the 
kernel PCA method for non-stationary data. 

An LS-SVM approach to kernel PCA is given in [100]. A weighted kernel PCA 
formulation based on the LS-SVM framework [5] has been used for extending 
kernel PCA to general loss functions in order to achieve sparseness and robustness 
in an efficient way [5]. 

An incremental kernel SVD algorithm based on kernelizing incremental linear 
SVD is given in [24]. It does not require adaptive centering of the incremental 
data and the appropriate adjustment of the factorized subspace bases. Kernel 
PCA and kernel SVD return vastly different results if the data set is not centered. 
In [24], reduced set expansions are constructed to compress the kernel SVD 
basis so as to achieve constant incremental update speed. By using a better 
compression strategy and adding a kernel subspace re-orthogonalization scheme, 
an incremental kernel PCA [25] has linear time complexity to maintain constant 
update speed and memory usage. 

A kernel projection pursuit method [96] chooses sparse directions and orthog- 
onalizes in the same way as PCA. Utilizing a similar methodology, a general 
framework for feature extraction is formalized based on an orthogonalization 
procedure, and two sparse kernel feature extraction methods are derived in [28]. 
These approaches have a training time O(N). 

A class of robust procedures for kernel PCA is proposed in [48] based on EVD 
of weighted covariance. The procedures place less weight on deviant patterns and 
thus is more resistant to data contamination and model deviation. 

Kernel entropy component analysis [51] for data transformation and dimension 
reduction reveals structure relating to the Renyi entropy of the input space data 
set, estimated via a kernel matrix using Parzen windowing. This is achieved 
by projections onto a subset of entropy-preserving kernel PCA axes. In [101], a 
kernel-based method for data visualization and dimension reduction is proposed 
in the framework of LS-SVMs. 
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Figure 17.1 The first three principal components for the two-dimensional data set. ©IEEE, 2005 [53]. 


Example 17.1: We replicate this example from [53]. For a two-dimensional data 
set, with 150 data points generated from y; = —x? + €, where x; is generated 
from uniform distribution in [—1, 1] and € is normal noise with standard deviation 
0.2. Contour lines of constant value of the first three principal components for 
the data set are obtained from kernel PCA with degree-2 polynomial kernel, as 
shown in Fig. 17.1. 


Example 17.2: Image denoising. A noisy image can be denoised by using ker- 
nel PCA. A large image is regarded as a composition of multiple patches. The 
multiple patches are used as training samples for learning the few leading eigen- 
vectors by kernel PCA. The image is reconstruced by reconstructing each of the 
component patches and the noise is thus removed. This approach is similar to 
wavelet-based methods for image denoising in the sense that the objective is to 
find a good feature space in which the noise shows low power or is concentrated 
on a small subspace. 

We replicate this example from [53]. Two different noisy images were con- 
structed by adding white Gaussian noise (SNR 7.72 dB) and salt-and-pepper 
noise (SNR 4.94 dB) to the 256 x 256-sized Lena image. From each image, 
12 x 12 overlapping image patches were sampled at a regular interval of two 
pixels. The kernel PCA model for the Gaussian kernel (ø = 1) was obtained by 
training the kernel Hebbian algorithm on each training set with a learning rate 
n = 0.05 for around 800 epochs through the data set. The denoised images were 
then obtained by reconstructing the input image using the first r principal com- 
ponents from each kernel PCA model. The kernel Hebbian algorithm performs 
well for both noise types. Similar to linear PCA, kernel PCA is expected to work 
best if the noise characteristics are Gaussian in the feature space. The original, 
noisy, and denoised Lena images are shown in Fig. 17.2. 
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Figure 17.2 Denoising Gaussian noise (Upper): (a) Original image, (b) input noisy image,(c) PCA 
(r = 20), and (d) KHA (r = 40). Denoising salt and pepper type noise (Lower): (a) original image, 
(b) input noisy image, (c) PCA (r = 20), and (d) KHA (r = 20). ©IEEE, 2005 [53]. 


17.3 


Kernel LDA 


Kernel LDA consists of a two-stage procedure. The first step consists of embed- 
ding the data space X into a possibly infinite-dimensional RKHS F via a kernel 
function. The second step simply applies LDA in this new data space. Like LDA, 
kernel LDA always encounters the ill-posed problem, and some form of regular- 
ization needs to be included. Kernel LDA can be formulated for two classes. The 
inner product matrix can be made nonsingular by adding a scalar matrix [76]. 
By introducing kernel into linear w, nonlinear discriminant analysis is obtained 
(76, 77, 114]. A multiple of the identity or the kernel matrix can be added to S,, 
(or its reformulated form Sw) after introducing the kernels to penalize ||w||? (or 
lõ?) [76]. 

Hard margin linear SVM is equivalent to LDA when all the training points 
are support vectors [95]. SVM and kernel LDA usually have similar performance. 
In [89], the vector obtained by kernel QP feature selection is equivalent to the 
kernel Fisher vector and therefore, an interpretation of kernel LDA is given which 
provides some computational advantages for highly unbalanced data sets. 

Kernel LDA for multiple classes is formulated in [10]. The method employs 
QR decomposition to avoid singularity. Kernel LDA is much more effective than 
kernel PCA in face recognition [71]. Kernel direct discriminant analysis [71] gen- 
eralizes direct LDA. Based on the kernel PCA plus LDA framework, complete 
kernel LDA [119] can be used to carry out discriminant analysis in double dis- 
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criminant subspaces. It can make full use of regular and irregular discriminant 
information. Complete kernel LDA outperforms other kernel LDA algorithms. 

Kernel quadratic discriminant [84] is based on the regularized kernel Maha- 
lanobis distance in both complete and class-related subspaces. The method can 
be advantageous for data with unequal class spreads in the kernel-induced spaces. 

The properties of kernel uncorrelated discriminant analysis and kernel regular- 
ized discriminant analysis are studied in [52]. Under a mild condition, both LDA 
and kernel uncorrelated discriminant analysis project samples in the same class 
to a common vector in the dimension-reduced space. This implies that uncorre- 
lated discriminant analysis may suffer from the overfitting problem if there are a 
large number of samples in each class. Regularization can be applied to overcome 
the overfitting problem. 

Robust kernel fuzzy discriminant analysis [46] uses fuzzy memberships to 
reduce the effect of outliers and adopts kernel methods to accommodate non- 
linearly separable cases. In [109], a scatter-matrix based class separability crite- 
rion is extended to a kernel space and developed as a feature selection criterion. 
Compared with the radius-margin bound, feature selection with this criterion is 
faster. The proposed criterion is more robust in the case of small sample set and 
is less vulnerable to noisy features. This criterion is proven to be a lower bound 
of the maximum value of kernel LDA’s objective function. 

The common vector method is a linear subspace classifier method which allows 
one to discriminate between classes of data sets. This method utilizes subspaces 
that represent classes during classification. The kernel discriminant common vec- 
tor method [20] yields an optimal solution for maximizing a modified LDA cri- 
terion. A modified common vector method and its kernelized version are given 
n [21]. Under certain conditions, a 100% recognition rate is guaranteed for the 
training set samples. 

The kernel trick is applied to transform the linear-domain Foley-Sammon opti- 
mal, with respect to orthogonality constraints, discriminant vectors, resulting in 
a nonlinear LDA method [130]. The kernel Foley-Sammon method may suffer 
from the heavy computation problem due to the inverse of matrices, resulting 
in a cubic complexity for each discriminant vector. A fast algorithm for solv- 
ing this kernelized model [132] is based on rank-one update of the eigensytems 
and the QR decomposition of matrices, to incrementally establish the eigensys- 
tems for the discriminant vectors. It only requires a square complexity for each 
discriminant vector. To further reduce the complexity, a kernel Gram-Schmidt 
orthogonalization method [113] is adopted to replace kernel PCA in the prepro- 
cessing stage. 

A criterion is derived in [125] for finding a kernel representation where the 
Bayes classifier becomes linear. It maps the original class (or subclass) distribu- 
tions into a kernel space where these are best separated by a hyperplane. The 
approach aims to maximize the distance between the distributions of different 
classes, thus maximizing generalization. It is applied to LDA, nonparametric 


ww ai bbt.com DOOOO000 


558 


17.4 


Chapter 17. Other kernel methods 


discriminant analysis and subclass discriminant analysis. A kernel version of 
subclass discriminant analysis yields the highest recognition rates [125]. 

Leave-one-out crossvalidation for kernel LDA [17] is extended in [91] such that 
the leave-one-out error can be re-estimated following a change in the regulariza- 
tion parameter in kernel LDA. This reduces the computational complexity from 
O(N?) to O(N?) operations for N training patterns. The method is competitive 
with model selection based on k-fold crossvalidation in terms of generalization, 
while being considerably faster. In [110], by following the principle of maximum 
information preservation, model selection in kernel LDA is formulated as select- 
ing an optimal kernel-induced space in which different classes are maximally 
separated from one another. The kernel parameters are tuned by maximizing 
a scatter-matrix based criterion. Compared with crossvalidation, this approach 
achieves faster model selection, especially when the number of training samples 
is large or when many kernel parameters need to be tuned. 


Kernel clustering 


Kernel-based clustering first nonlinearly maps the patterns into an arbitrarily 
high-dimensional feature space, and then performs clustering in the feature space. 
Some examples are kernel C-means [93], kernel subtractive clustering [54], a 
kernel-based algorithm that minimizes the trace of the within-class scatter matrix 
[39], and support vector clustering [16]. Support vector clustering can effectively 
deal with the outliers. 

C-means can only find linearly separable clusters. Kernel C-means [93] iden- 
tifies nonlinearly separable clusters; it monotonically converges if the kernel 
matrix is positive semidefinite. If the kernel matrix is not positive semidefinite, 
the convergence of the algorithm is not guaranteed. Kernel C-means requires 
O(N?r) scalar operations, where 7 is the number of iterations until convergence 
is achieved. Performing C-means in the kernel PCA space is equivalent to kernel 
C-means [64]. Weighted kernel C-means [29], which assigns different weights to 
each data point, is closely related to graph partitioning as its objective becomes 
equivalent to many graph cut criteria if the weights and kernel are set appro- 
priately. In a kernel method for batch clustering inspired by C-means [16] each 
cluster is iteratively refined using a one-class SVM. Multilevel kernel C-means 
clustering [30] does not require to compute the whole kernel matrix on the train- 
ing data set, and thus, is extremely efficient. Global kernel C-means [107] adds 
one cluster at each stage, through a global search procedure consisting of several 
executions of kernel C-means from suitable initializations. Fast global kernel C- 
means and global kernel C-means with convex mixture models are proposed to 
speed up. 

Kernelized FCM [128] substitutes the Euclidean distance with kernel function, 
and it outperforms FCM. Kernel C-means and kernel FCM perform similarly 
in terms of classification quality when using a Gaussian kernel function and 
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generally perform better than their standard counterparts [55]. Generic FCM 
and Gustafson-Kessel FCM are compared with two typical generalizations of 
kernel-based fuzzy clustering in [40]. The kernel-based FCM algorithms produce 
a marginal improvement over FCM and Gustafson-Kessel for most of the data 
sets. But they are in a number of cases highly sensitive to the selection of specific 
values of the kernel parameters. 

A kernel-induced distance is used to replace the Euclidean distance in the 
potential function [54] of the subtractive clustering method. This enables cluster- 
ing of the data that is linearly inseparable in the original space into homogeneous 
groups in the transformed high-dimensional space, where the data separability 
is increased. 

A kernel version of SOM given in [73] is derived from kernelizing C-means 
with added neighbourhood learning, based on the distance kernel trick. In the 
self-organising mixture network [123], neurons in SOM are treated as Gaussian 
(or other) kernels, and the resulting map approximates a mixture of Gaussian (or 
other) distributions of the data. Kernel SOM can be derived naturally by mini- 
mizing an energy function, and the resulting kernel SOM unifies the approaches 
to kernelize SOM and can be performed entirely in the feature space [63]. SOM 
approximates the kernel method naturally, and further kernelizing SOM may not 
be necessary: there is no clear evidence showing that kernel SOMs are always 
superior to SOM [63]. 

The kernel-based maximum entropy learning rule (kK MER) [108] is an approach 
that formulates an equiprobabilistic map using maximum entropy learning to 
avoid the underutilization problem for clustering. The kMER approach updates 
the prototype vectors and the corresponding radii of the kernels centered on these 
vectors to model the input density at convergence. KMER considers several win- 
ners at a time. This leads to computational inefficiency and a slow formation 
of the topographic map. SOM-kMER [104] integrates SOM and kMER. It allo- 
cates a kernel at each neuron which is a prototype of a cluster of input samples. 
Probabilistic SOM-kMER [105] utilizes SOM-kMER and the probabilistic neural 
network for classification problems. Instead of using all the training samples, the 
prototype vectors from each class of the trained SOM-kKMER map are used for 
nonparametric density estimation. 

The kernel version of neural gas [87] applies the soft rule for the update to the 
codevectors in feature space. An online version of kernel C-means can be found 
in [93]. Other methods are kernel possibilistic C-means [33] and kernel spectral 
clustering [6]. 

Self-adaptive kernel machine [14] is an online clustering algorithm for evolving 
clusters from non-stationary data, based on SVM and the kernel method. The 
algorithm designs an unsupervised feedforward network architecture with self- 
adaptive abilities. 
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Kernel autoassociators, kernel CCA and kernel ICA 


Kernel autoassociators 

Kernel associative memory introduces the kernel approach to associative mem- 
ory by nonlinearly mapping the data into some high-dimensional feature space 
through operating a kernel function with input space. Kernel autoassociators 
are generic one-class learning machines. They perform autoassociation mapping 
via the kernel feature space. They can be applied to novelty detection and m- 
class classification problems. Kernel autoassociators have the same expression 
as that of the kernel associative memory [129]. Thus, kernel associative memo- 
ries are a special form of kernel autoassociators. In [38], the update equation of 
the dynamic associative memory is interpreted as a classification step. A model 
of recurrent kernel associative memory [38] is analyzed in [85], and this model 
consists in a kernelization of RCAM [85]. 


Kernel CCA 
If there is nonlinear correlation between two variables, CCA may not cor- 
rectly correlate this relationship. Kernel CCA first maps the data into a high- 
dimensional feature space induced by a kernel and then performs CCA. In this 
way, nonlinear relationships can be found [8], [60]. Kernel CCA has been used 
in a kernel ICA algorithm [8]. 

Given two random variables X and Y, CCA looks for linear mappings a? X and 
bTY that achieve maximum correlation. The purpose of kernel CCA is to provide 
nonlinear mappings f(X) and g(Y), where f and g belong to the respective 
RKHSs Hx and Hy, i.e., f€ Hx and g € Hy, such that their correlation is 
maximized: 

cov[ f(X), gY )] 


oe 17.12 
feHx eH, f#0,940 var fOO] var gA oe) 


Kernel CCA [8], [60] tackles the singularity problem of the Gram matrix by 
simply adding a regularization term to the Gram matrix. A mathematical proof 
of the statistical convergence of kernel CCA is given in [35]. The result also gives 
a sufficient condition for convergence on the regularization coefficient involved in 
kernel CCA. An improved kernel CCA algorithm based on EVD approach rather 
than the regularization method is given in [131]. 

Least-squares weighted kernel reduced rank regression (LS-WKRRR) [27] is 
a unified framework for formulating many component analysis methods such as 
PCA, LDA, CCA, locality preserving projections, and spectral clustering, and 
their kernel and regularized extensions. 


Kernel ICA 

A class of kernel ICA algorithms [8] use contrast functions based on canonical 
correlations in an RKHS. In [4], the proposed kernel-based contrast function 
for ICA corresponds to a regularized correlation measure in high-dimensional 
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feature spaces induced by kernels. The formulation is a multivariate extension 
of the LS-SVM formulation to kernel CCA. The kernel-based nonlinear BSS 
method [75] exploits second-order statistics in a kernel-induced feature space. 
This method extends a linear algorithm to the nonlinear domain using the kernel- 
based method of extracting nonlinear features applied in SVM and kernel PCA. 
kTDSEP [43] combines kernel feature spaces and second-order temporal decor- 
relation BSS using temporal information. 


Other kernel methods 


Conventional online linear algorithms have been extended to their kernel ver- 
sions: kernel LMS algorithm [86], [69], kernel adaline [34], kernel RLS [32], ker- 
nel Wiener filter [124], kernel discriminant NMF [65], complex kernel LMS [11], 
kernel LS [90], and kernel online learning [58]. 

The kernel ridge regression classifier is provided by the generalized kernel 
machine toolbox [18]. The constrained covariance and the kernel mutual infor- 
mation [41] measure independence based on the covariance between functions 
of the random variables in RKHSs. Two generalizations of NMF in kernel fea- 
ture space are polynomial kernel NMF [15] and projected gradient kernel NMF 
(PGKNMF) [126]. An RKHS framework for spike trains is introduced in [82]. 

A common problem of kernel-based online algorithms is the amount of memory 
required to store the online hypothesis, which may increase without bound as the 
algorithm progresses. Furthermore, the computational load of such algorithms 
grows linearly with the amount of memory used to store the hypothesis. 

Kernel density estimation is a nonparametric method that yields a pdf, given 
a set of observations. The resulting pdf is the sum of kernel functions centered 
in the data points. The computational complexity of kernel density estimation 
makes its application to data streams impossible. The cluster kernel approach 
[44] provides continuously computed kernel density estimators over streaming 
data. 

Based on the perceptron algorithm, projectron [81] projects the instances onto 
the space spanned by the previous online hypothesis. Projectron++ is deduced 
based on the notion of large margin. Both projectron and projectron++ are 
compared to the perceptron, forgetron [26] and randomized budget perceptron 
[19] algorithms, by using the DOGMA library. The performance of the projectron 
algorithm is slightly worse than, but very similar to, that of the perceptron 
algorithm, for a wide range of the learning rate. Projectron++ outperforms 
perceptron, projectron, forgetron and randomized budget perceptron, with a 
similar hypothesis size. For a given target accuracy, the size of the support sets 
of projectron or projectron++ are much smaller than those of forgetron and 
randomized budget perceptron. 

Many kernel classifiers do not consider the data distribution and are difficult 
to output the probabilities or confidences for classification. Kernel-based MAP 


ww ai bbt.com DOOOO000 


562 


Chapter 17. Other kernel methods 


classification [117] makes a Gaussian distribution assumption instead of a linearly 
separable assumption in the feature space. Robust methods are proposed to 
estimate the probability densities, and the kernel trick is utilized to calculate 
the model. The model can output probability or confidence for classification. 

A sparsity-driven kernel classifier based on the minimization of a data- 
dependent generalization error bound is considered in [83]. The objective function 
consists of the usual hinge loss function penalizing training errors and a concave 
penalty function of the expansion coefficients. The problem of minimizing the 
non-convex bound is addressed by a successive linearization approach, whereby 
the problem is transformed into a sequence of linear programs. The algorithm 
produces error rates comparable to SVM but significantly reduces the number 
of support vectors. The kernel classifier in [57] optimizes the Lə or integrated 
squared error of a difference of densities. Like SVM, this classifier is sparse and 
results from solving a quadratic program. The method allows data-adaptive ker- 
nels, and does not require an independent sample. 

Kernel logistic regression corresponds to the penalized logistic classification in 
an RKHS [50], [133]. It is easier to analyze than SVM because it has the logistic 
loss function and SVM has a hinge loss function. The computational cost for 
kernel logistic regression is much higher than that for SVMs with decomposition 
algorithms. A decomposition algorithm for the kernel logistic regression is intro- 
duced in [133]. In kernel logistic regression, the kernel expansion is non-sparse 
in the data. 

In the finite training data case, kernel LMS [69] is well-posed in RKHS with- 
out the addition of an regularization term. Kernel affine projection algorithms 
[68] combine the kernel trick and affine projection algorithms. They inherit the 
simplicity and online nature of kernel LMS while reducing its gradient noise 
and boosting performance. Kernel affine projection provides a unifying model 
for kernel LMS, kernel adaline, sliding-window kernel RLS, and regularization 
networks. 

A reformulation of the sampling theorem using an RKHS, presented by Nashed 
and Walter [78], is an important milestone in the history of the sampling theorem. 
The formulation gives us a unified viewpoint for many generalizations and exten- 
sions of the sampling theorem. In [79], a framework of the optimal approximation, 
rather than a perfect reconstruction, of a function in the RKHS is introduced by 
using the orthogonal projection onto the linear subspace spanned by the given 
system of kernel functions corresponding to a finite number of sampling points. 
This framework is extended to infinite sampling points in [103]. 

In [37], learning kernels under the LASSO formulation is implemented via 
adopting a generative Bayesian learning and inference approach. A robust learn- 
ing algorithm proposed produces a sparse kernel model with the capability of 
learning regularized parameters and kernel hyperparameters. A comparison with 
sparse regression models such as RVM and the local regularization assisted OLS 
regression shown considerable computational advantages. 
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Bounds on expected leave-one-out crossvalidation errors for kernel methods are 
derived in [127], which lead to expected generalization bounds for various kernel 
algorithms. In addition, variance bounds for leave-one-out errors are obtained. 
In [74], prior knowledge over arbitrary general sets is incorporated into nonlin- 
ear kernel approximation problems in the form of linear constraints in a linear 
program. 


Multiple kernel learning 


Multiple kernel learning (MKL) [62], [80] considers multiple kernels or the com- 
bination of kernels rather than a single fixed kernel. It tries to form an ensemble 
of kernels so as to fit for a given application. MKL can offer some needed flex- 
ibility and manipulate well the case that involves multiple, heterogeneous data 
sources. 

In [62], MKL was formulated as a semi-definite programming (SDP) problem. 
A convex quadratically constrained quadratic program (QCQP) is constructed 
by the conic combinations of multiple kernels k = 5°, a;k; from a library of 
candidate kernels k; [62]. In order to extend the method to large-scale problems, 
QCQP is reconstructed as a semi-infinite linear program that recycles the SVM 
implementations [98]. 

For regularized kernel discriminant analysis, the optimal kernel matrix is 
obtained as a linear combination of pre-specified kernel matrices [122]. The kernel 
learning problem can be formulated as a convex SDP. A convex QCQP formula- 
tion is proposed for binary-class kernel learning in regularized kernel discriminant 
analysis, and the QCQP formulations are solved using the MOSEK interior point 
optimizer for LP [122]. Multi-class regularized kernel discriminant analysis can 
be decomposed into a set of binary-class kernel learning problems which are con- 
strained to share a common kernel; SDP formulations are then proposed, which 
lead naturally to QCQP and semi-infinite LP formulations. 

Sparse MKL [99] generalizes group feature selection to kernel selection. It is 
capable of exploiting existing efficient single kernel algorithms while providing 
a sparser solution in terms of the number of kernels used as compared to the 
existing MKL framework. 

Most MKL methods employ the Lı-norm simplex constraints on the kernel 
combination weights, which therefore involve a sparse but non-smooth solution 
for the kernel weights. Despite the success of their efficiency, they tend to dis- 
card informative complementary or orthogonal base kernels and yield degener- 
ated generalization performance. To tackle these problems, by introducing an 
elastic-net-type constraint on the kernel weights, a generalized MKL model has 
been proposed in [121], which enjoys the favorable sparsity property on the solu- 
tion and also facilitates the grouping effect. This yields a convex optimization 
problem. 
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SimpleMKL [88] performs a reduced gradient descent on the kernel weights. 
HessianMKL [23] replaces the gradient descent update of SimpleMKL with a 
Newton update. At each iteration, HessianMKL solves a QP problem with the 
size of the number of kernels to obtain the Newton update direction. HessianMKL 
shows second-order convergence. 

By proposing Lp-norm MKL, in a non-sparse scenario Lp-norm MKL yields 
strictly better bounds than Lı-norm MKL and vice versa [59], [1]. The analyti- 
cal solver for non-sparse MKL is compared with some L,-norm MKL methods, 
namely, SimpleMKL [88], HessianMKL, SILP-based wrapper, and SILP-based 
chunking optimization [98]. SimpleMKL and the analytical solver become more 
efficient with increasing number of kernels, but the capacity remains limited due 
to memory restriction. HessianMKL is considerably faster than SimpleMKL but 
slower than the non-sparse interleaved methods and SILP. Overall, the inter- 
leaved analytic and cutting plane based optimization strategies [59] achieve a 
speedup of up to one and two orders of magnitude over HessianMKL and Sim- 
pleMKL, respectively. 

SpicyMKL [102] is applicable to general convex loss functions and general types 
of regularization. By iteratively solving smooth minimization problems, there is 
no need of solving SVM, LP or QP internally. SpicyMKL can be viewed as a 
proximal minimization method and converges super-linearly. The cost of inner 
minimization is roughly proportional to the number of active kernels. Spicy MKL 
is faster than existing methods especially when the number of kernels is large 
(> 1000). 

In [115], a soft margin framework for MKL is proposed by introducing kernel 
slack variables. The commonly used hinge loss, square hinge loss, and square loss 
functions can be incorporated into this framework. Many existing MKL methods 
can be shown to be special cases under the soft margin framework. D;-norm MKL 
can be deemed as hard margin MKL. The proposed algorithms can efficiently 
achieve an effective yet sparse solution for MKL. 

In [56], a sparsity-inducing multiple kernel LDA is introduced, where an Lı 
norm is used to regularize the kernel weights. This optimal kernel selection 
problem can be reformulated as a tractable convex optimization problem which 
interior-point methods can solve globally and efficiently. Multiple kernel FCM 
[49] extends the FCM algorithm with a multiple kernel-learning setting. 


17.1 Show that Mercer kernels are positive definite. 


17.2 Synthetically generate 2-dimensional data which lie on a circle and are 
corrupted by the Gaussian noise. Use STPRtool (http://cmp.felk.cvut.cz/ 
cmp/software/stprtool1/) to find the first principal components of the data by 
using PCA and kernel PCA model with RBF kernel. 
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17.3 Face image superresolution and denoising [53]. Apply the kernel 
PCA method on a face image for image superresolution purpose and for denois- 
ing purpose [53]. The Yale Face Database B is used for the experiments. The 
database contains 5,760 images of 10 persons, down-sampled to 60 x 60 pixels. 
Five thousand images are used for training and the remaining are used for test- 
ing. Kernel Hebbian algorithm is applied during the training and 16 eigenvectors 
are obtained. Ten test samples are randomly selected from the test set. 

(a) Downsample the test samples to 20 x 20 resolution and then resize them to 
60 x 60 resolution by mapping each pixel to a 3 x 3 block of identical pixel val- 
ues. Project the precessed test samples to the obtained eigenvectors. Reconstruct 
the test samples by finding the closest samples in the training set. 

(b) Project the test samples directly to the obtained eigenvectors. Reconstruct 
the test samples from the projections. 


17.4 Superresolution of a natural image [53]. Apply the kernel PCA 
method on a natural image of low resolution to get a superresolution image 
[53]. 


17.5 Consider an artificial 4 x 4 checkerboard data based on a uniform distri- 
bution. Use kernel LDA to separate the samples. Plot the separating boundary. 
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18.1 


Reinforcement learning 


Introduction 


Reinforcement learning has its origin in the psychology of animal learning. It 
awards the learner (agent) for correct actions, and punishes for wrong actions. 
In the mammalian brain, learning by reinforcement is a function of brain nuclei 
known as the basal ganglia. The basal ganglia uses this reward-related informa- 
tion to modulate sensory-motor pathways so as to render future behaviors more 
rewarding [16]. 

Reinforcement learning is a type of machine learning in which an agent (e.g. a 
real or simulated robot) seeks an effective policy for solving a sequential decision 
task. Such a policy dictates how the agent should behave in each state it may 
encounter in order to maximize total expected reward (or minimize punishment) 
by trial-and-error interaction with a dynamic environment [1], [22]. There is no 
need to specify how the task is to be achieved. The computed difference, termed 
reward-prediction error, has been shown to correlate very well with the phasic 
activity of dopamine-releasing neurons projecting from the substantia nigra in 
non-human primates [19]. 

The reinforcement learning problem is defined by three features, namely, agent- 
environment interface, function for evaluative feedback, and Markov property of 
the learning process. The agent is connected to its environment via sensors. An 
agent acts in an unknown or partly known environment with the goal of maxi- 
mizing an external reward signal. This is in keeping with the learning situations 
an animal encounters in the real world, where there is no supervision but rewards 
and penalties such as hunger, satiety, pain and pleasure abound. 

Reinforcement learning is a special case of supervised learning, where the exact 
desired output is unknown. A learner must explicitly explore its environment. 
The teacher supplies only feedback about success or failure of an answer. This is 
cognitively more plausible than supervised learning, since a fully specified correct 
answer might not always be available to the learner or even the teacher. It is based 
only on an evaluative feedback, which can be noisy or sparse: the information 
as to whether or not the actual output is close to the estimate. Reinforcement 
learning is a learning procedure that rewards the agent for its good output result 
and punishes it for the bad output result. Explicit computation of derivatives 
is not required. This, however, presents a slower learning process. For a control 
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system, if the controller still works properly after an input, the output is judged 
as good; otherwise, it is considered as bad. The evaluation of the binary output, 
called external reinforcement, is used as the error signal. 

When considering the base elements of decision optimization (states, actions 
and reinforcements) from a system-theoretic perspective, the reinforcement 
learning model could be implied together with the interpretation as a decision 
tree. The objective of reinforcement learning is to find a path through the decision 
tree which maximizes the sum of rewards. Reinforcement learning is a practical 
tool for solving sequential decision problems that can be modeled as Markov 
decision problems. It is among the most general frameworks of learning control 
to create truly autonomous learning systems. Reinforcement learning is widely 
used in robot control and artificial intelligence. It has influenced a number of 
fields, including operations research, cognitive science, optimal control, psychol- 
ogy, neuroscience and others. 

We now mention in passing the difference between off-line and online imple- 
mentations. In an off-line implementation, the algorithm runs on the simulator 
before being implemented on the real system. On the other hand, in an online 
implementation, the algorithm runs on a real-time basis in the real system. 


Reinforcement learning vs. dynamic programming 

Dynamic programming is a mathematical method for finding an optimal control 
and its solution using a value function in a dynamic system. It is guaranteed to 
give optimal solutions to Markov decision problems (MDPs) and semi-Markov 
decision problems. The two main algorithms of dynamic programming, value iter- 
ation and policy iteration, are based on the Bellman equation, which contains the 
elements of the value function as the unknowns. In dynamic programming, the 
transition probability and the transition reward matrices are first generated, and 
these matrices are then used to generate a solution. All dynamic programming 
algorithms are model-based. The mechanism of model building is to construct the 
transition probability model in a simulator by straightforward counting. Imple- 
mentation of dynamic programming is often a difficult and tedious process that 
involves a theoretical model. 

Reinforcement learning is essentially a form of simulation-based dynamic pro- 
gramming and is primarily used to solve Markov and semi-Markov decision prob- 
lems. As such, it is often called a heuristic dynamic programming technique. 
The environment can be modeled as a finite Markov decision process where the 
goal of the agent is to obtain near-optimal discounted return. In reinforcement 
learning, we do not estimate the transition probability or the reward matrices; 
instead we simulate the system using the distributions of the governing random 
variables. It can learn the system structure by trial-and-error, and is suitable 
for online learning. Most reinforcement learning algorithms calculate the value 
function of dynamic programming. The value function is stored in the form of 
the so-called Q-values. Most reinforcement learning algorithms are based on the 
Q-value version of the Bellman equation. An algorithm that does not use tran- 
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Figure 18.1 Principle of reinforcement learning. 


18.2 


sition probabilities in its updating equations is called a model-free algorithm. A 
suitable reinforcement learning algorithm can obtain a near-optimal solution. 

Reinforcement learning can also avoid the computational burden encoun- 
tered by dynamic programming. A major difficulty associated with the Bellman 
equation-based approaches is the curse of dimensionality. Consider a problem 
with a million state-action pairs. Using model-free algorithms, one can avoid 
storing the huge transition probability matrices. Nonetheless, one must still find 
some way to store the one million Q-values. Function approximation is a strategy 
for reducing the storage; it can be done by state aggregation, function fitting, 
and function interpolation. 


Learning through awards 


Reinforcement learning is illustrated in Fig. 18.1. It learns a mapping from sit- 
uation to actions by maximizing the scalar reward or reinforcement signal, fed 
back from the environment or an external evaluator. An agent receives sensory 
inputs (as the state of the environment) from its environment. Once an action is 
executed, the agent receives a reinforcement signal or reward. A negative reward 
punishes the agent for a bad action. This loop can help the algorithm to stabilize. 
The trial-and-error search helps to find better action, while a memory of good 
actions helps to keep good solutions such that a reward can be assigned. This is 
an exploration-exploitation tradeoff. 

In every step of interaction the agent receives a feedback about the state of the 
environment s;,; and the reward r;+, for its latest action aş. The agent chooses 
an action az41 representing the output function, which changes the state s:41 
of environment and thus leads to state s;,2. The agent receives new feedback 
from reinforcement signal r:42. The reinforcement signal r is often delayed since 
it is a result of network outputs in the past. This is solved by learning a critic 
network which represents a cost function J predicting future reinforcement. 

A reinforcement learning agent has several components. The goal of the agent 
is to maximize the total reward it receives over the whole learning process. A 
policy is a decision-making function that specifies the next action from current 
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situation. A policy is a mapping from state to action. A reward function defines 
what are good and bad actions. Design of a reward function is of critical impor- 
tance. A value function predicts future reward. The agent’s job is to find a policy 
t, Mapping states to actions, that maximizes some long-run measure of reinforce- 
ment. The quality of this policy is quantified by the so-called value function. In 
order to find an optimal policy, it is necessary to find an optimal value function. 
Multiobjective reinforcement learning problems have two or more objectives to 
be achieved by the agent, each with its own associated reward signal. It generates 
multiple policies rather than a single policy. 

Reinforcement learning can be subdivided into two fundamental problems: 
learning and planning. Learning is for an agent to improve its policy from inter- 
actions with its environment, and planning is for an agent to improve its policy 
without further interaction with its environment. 

Reward shaping provides an additional reward that does not come from the 
environment. The additional reward is extra information that is incorporated by 
the system designer and estimated on the basis of knowledge of the problem. A 
number of algorithms are available [10]. 

In discrete reinforcement learning, numbers of states and actions are finite and 
countable, and the values of states (or state-action pairs) are saved in a value 
table whose elements are adjusted independently [26]. In contrast, continuous 
reinforcement learning has infinite numbers of states and actions, and function 
approximators are used to approximate the value function. Changing an approx- 
imator parameter may cause changes in the approximate values of the entire 
space. Considering these differences, the balance management methods in dis- 
crete reinforcement learning cannot be directly used or do not improve balance 
in the continuous case. 

A fundamental problem in reinforcement learning is the exploration- 
exploitation problem. A suitable strategy is to have higher exploration and 
lower exploitation at the early stage of learning, and then to decrease explo- 
ration and increase exploitation gradually. Before exploiting, however, adequate 
exploration and accurate estimate of action value function should have been 
achieved. Although in discrete reinforcement learning, having longer exploration 
leads to more accurate action value function, it may not be the case in contin- 
uous reinforcement learning. In fact, in continuous reinforcement learning, the 
approximation accuracy of action value function depends on the distribution of 
data. 

The scalability of reinforcement learning to high-dimensional continuous state- 
action systems remains problematic, namely, the curse of dimensionality. Hierar- 
chical reinforcement learning scales to problems with large state spaces by using 
the task (or action) structure to restrict the space of policies [7]. Free energy 
based reinforcement learning [17] is capable of handling high-dimensional inputs 
and actions. The method facilitates the approximation of action values in large 
state and action spaces using the negative free energy of the network state. The 
method is implemented by the restricted Boltzmann machine. It is able to solve 
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a temporal credit assignment problem in Markov decision processes with large 
state and action spaces. 

The relative payoff procedure [8] is a static reinforcement learning algorithm 
whose foundation is not stochastic gradient ascent. Under certain circumstances 
applying the relative payoff procedure is guaranteed to increase the mean return, 
even though it can make large changes in the values of the parameters [5]. The 
idea of transfer learning has been applied to reinforcement learning tasks [23]. 

Inverse reinforcement learning is the problem of recovering the underlying 
reward function from the behavior of an expert. This is a natural way to exam- 
ine animal and human behaviors. Most of the existing inverse reinforcement 
learning algorithms assume that the environment is modeled as a Markov deci- 
sion process. For more realistic partially observable environments that can be 
modeled as a partially observable Markov decision process, inverse reinforce- 
ment learning poses a greater challenge since it is ill-posed and computationally 
intractable. These obstacles are overcome in [3]. The representation of a given 
expert’s behavior can be the case in which the expert’s policy is explicitly given, 
or the case in which the expert’s trajectories are available instead. 


Actor-critic model 


A reinforcement learning agent aims to find the policy m which maximizes the 
expected value of a certain function of the immediate rewards received while 
following the policy 7. Optimality for both the state value function V, and the 
state action value function Q, is governed by the Bellman optimality equation. 

The learning agent can be split into two separate entities: the actor (policy) 
and the critic (value function). As such, reinforcement learning algorithms can 
be divided into three groups [1]: actor-only, critic-only, and actor-critic methods. 
Actor-only methods typically work with a parameterized policy over which opti- 
mization can be used directly. The optimization methods used suffer from high 
variance in the estimates of the gradient. Due to its gradient-descent nature, 
actor-only methods have strong convergence property but slow learning. Com- 
pared to critic-only methods, actor-only methods allow the policy to generate 
actions in the complete continuous action space. 

Critic-only methods have a lower variance in the estimates of expected returns 
[20]. A policy can be derived by selecting greedy actions [22]. Therefore, critic- 
only methods usually discretize the continuous action space, after which the 
optimization over the action space becomes a matter of enumeration. Critic-only 
methods, such as Q-learning [27], [28] and SARSA [22], use a state-action value 
function and no explicit function for the policy. For continuous state and action 
spaces, this will be an approximate state-action value function. For Q-learning, 
the system estimates an action value function Q(s, a) for all state-action pairs, 
and selects the optimal-control algorithm based on Q(s, a). 
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Figure 18.2 Illustration of an actor-critic network. The actor is responsible for generating a control 
input u, given the current state s. The critic is responsible for updating the actor and itself. 


18.4 


Actor-critic methods combine the merits of both the actor-only and critic- 
only methods. In actor-critic methods [1], the critic evaluates the quality of the 
current policy prescribed by the actor. The actor is a separate memory structure 
to explicitly represent the control policy. The critic approximates and updates 
the value function using the rewards it receives. The state value function V(s) is 
then used to to update the actor’s policy parameters for the best control actions 
in the direction of performance improvement. Policy-gradient-based actor-critic 
algorithms are popular for being able to search for optimal policies using low- 
variance gradient estimates. 

Figure 18.2 gives the structure of an actor-critic network. The basic building 
blocks are an actor which uses a stochastic method to determine the correct 
relation between the input and the output, and an adaptive critic which learns 
to give a correct prediction of future reward or punishment [1]. The external 
reinforcement signal r can be generated by a special sensor or be derived from 
the state vector. 

The binary external reinforcement provides very limited information for the 
learning algorithm. An additional adaptive critic network [29] is usually used 
to predict the future reinforcement signal, called internal reinforcement. This 
assures avoiding bad states from happening. In most formulations, reinforce- 
ment learning is achieved by using reward-prediction errors, i.e., the difference 
between an agent’s current reward prediction and the actual reward obtained, 
to update the agent’s reward predictions. As the reward predictions are learned, 
the predictions can also be used by an agent to select its next action. 


Model-free and model-based reinforcement learning 


Agents may use value functions or direct policy search, be model-based or model- 
free. Model-free approaches obtain an optimal decision-making policy by directly 
mapping environmental states to actions. They learn to predict the utility of each 
action in different situations but they do not learn the effects of actions. Model- 
based methods attempt to construct a model of the environment, typically in 
the form of a Markov-decision process, followed by a selection of optimal actions 
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based on that model. As a consequence, model-based methods often make better 
use of a limited amount of experience and thus achieve a better policy with 
fewer environmental interactions. Model-free methods are simpler, and require 
less computational resources. 

Policy iteration and policy search are two popular formulations of model-free 
reinforcement learning. In the policy iteration approach [1], the value function 
is first estimated by solving a set of linear equations and then policies are deter- 
mined and updated based on the learned value function. The value function of 
a policy is just the expected infinite discounted reward that will be gained, at 
each state, by executing that policy. The optimal value function can be deter- 
mined by value iteration. Least-squares policy iteration [11] is an approximate 
policy iteration algorithm, providing good properties in convergence, stability, 
and sample complexity. 

Actor-critic algorithms follow policy iteration in spirit, and differ in many ways 
from the regular policy iteration algorithm that is based on the Bellman equation. 
They start with a stochastic policy in the simulator, where actions are selected 
in a probabilistic manner. Many actor-critic algorithms approximate the policy 
and the value function using neural networks. In policy iteration, the improved 
policy is greedy in the value function over the action variables. In contrast, 
actor-critic methods employ gradient rules to update the policy in a direction 
that increases the received returns. The gradient estimate is constructed using 
the value function. 

The adaptive heuristic critic algorithm [10] is an adaptive version of policy 
iteration in which the value function is computed by the TD(0) algorithm [20]. 
The method consists of a critic for learning the value function V, for a policy and 
a reinforcement-learning component for learning a new policy 7’ that maximizes 
the new value function. The work of the two components can be accomplished 
in a unified manner by the Q-learning algorithm. The two components operate 
in an alternative or a simultaneous manner. It can be hard to select the relative 
learning rates so that the two components converge together. The alternating 
implementation is guaranteed to converge to the optimal policy, under appropri- 
ate conditions [10]. 

Although policy iteration can naturally deal with continuous states by function 
approximation, continuous actions are hard to handle due to the difficulty of 
finding maximizers of value functions with respect to actions. Control policies 
can vary drastically in each iteration, causing severe instability in a physical 
system. In the policy search approach, control policies are directly learned so 
that the return is maximized. 

Standard approach to reinforcement learning specifies feedback in the form of 
real-valued rewards. Preference-based policy search uses a qualitative preference 
signal as feedback for driving the policy learner towards better policies [6]. Other 
forms of feedback, most notably external advice, can also be incorporated. 

Approximate policy evaluation is a difficult problem, because it involves find- 
ing an approximate solution to a Bellman equation. An explicit representation of 
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the policy can be avoided by computing improved actions on demand from the 
current value function. Alternatively, the policy can be represented explicitly by 
policy approximation. 

In model-based methods, an agent uses its experience to learn an internal 
model as to how the actions affect the agent and its environment. Such a model 
can be used in conjunction with dynamic programming to perform off-line plan- 
ning, often achieving better performance with fewer environmental samples than 
model-free methods do. The agent simultaneously uses experience to build a 
model and to adjust the policy, and uses the model to adjust the policy. Most 
of the model-building algorithms are based on the idea of computing the value 
function rather than computing Q values. However, since the updating within 
a simulator is asynchronous and step-size-based, Q-value versions are perhaps 
more appropriate. Model-building algorithms require more storage space in com- 
parison to their model-free counterparts. 

The Dyna architecture [21], [22] is a hybrid model-based and model-free rein- 
forcement learning algorithm, in which interactions with the environment are 
used both for a direct policy update with a model-free reinforcement learning 
algorithm, and for an update of an environmental model. Dyna applies temporal- 
difference learning both to real experience and to simulated experience. Dyna- 
2 combines temporal-difference learning with temporal-difference search, using 
long and short-term memories. The long-term memory is updated from real 
experience, and the short-term memory is updated from simulated experience, 
both using the TD(A) algorithm. The Dyna-style system proposed in [9] utilizes 
a temporal difference method for direct learning and relative values for plan- 
ning between two successive direct learning cycles. A simple predictor of average 
rewards is introduced to the actor-critic architecture in the simulation (plan- 
ning) mode. The accumulated difference between the immediate reward and the 
average reward is used to steer the process in the right direction. 


Temporal-difference learning 


When the agent receives a reward (or penalty), a major problem is how to 
distribute the reinforcement among the decisions that led to it; this is known 
as the temporal credit assignment problem. Temporal-difference learning is a 
particularly effective model-free reinforcement learning method of solving this 
problem. 

Temporal-difference methods [20] are a class of incremental learning proce- 
dures specialized for prediction problems, that is, for using past experience with 
an incompletely known system to predict its future behavior. They can be viewed 
as gradient descent in the space of the parameters by minimizing an overall error 
measure. The steps in a sequence should be evaluated and adjusted according 
to their immediate or near-immediate successors, rather than according to the 
final outcome. 
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Whereas conventional prediction-learning methods assign credit by means of 
the difference between predicted and actual outcomes, temporal-difference meth- 
ods assign credit by means of the difference between temporally successive pre- 
dictions, and learning occurs whenever there is a change in prediction over time. 
In temporal-difference learning, reward estimates at successive times are com- 
pared. By comparing reward estimates rather than waiting for a reward from the 
environment, a temporal-difference learning system is effective for solving tasks 
where the reward is sparse. Temporal-difference methods require low memory 
and peak computation, but produce accurate predictions [20]. The unique fea- 
ture of temporal-difference learning is its use of bootstrapping, where predictions 
are used as targets during the course of learning. 

Actor-critic methods [1] are a special case of temporal-difference methods. 
Temporal-difference error depends also on the reward signal obtained from the 
environment as a result of the control action. A spiking neural network model 
for implementing actor-critic temporal-difference learning that combines local 
plasticity rules with a global reward signal is given in [15]. The synaptic plasticity 
underlying the learning process relies on biologically plausible measures of pre- 
and postsynaptic activity and a global reward signal. 

We define (s, a,1r, s’) to be an experience tuple summarizing a single transition 
in the environment, where s and s’ are the states of the agent before and after the 
transition, a is its choice of action, and r the instantaneous reward it receives. 
The value of a policy is learned using the TD(0) algorithm [20] 


V(s) — V(s) +-y[r+ V(s') — V(s)]. (18.1) 


Whenever a state s is visited, its estimated value is updated to be closer to 
r+ V(s’). If the learning rate 7 is adjusted properly and the policy is held fixed, 
TD(0) is guaranteed to converge to the optimal value function. 

The TD(0) rule is an instance of a class of algorithms called TD(A), A € 
[0,1]. TD(0) looks only one step ahead when adjusting value estimates, and the 
convergence is very slow. At the other extreme, TD(1) updates the value of a 
state from the final return; it is equivalent to Monte-Carlo evaluation. The TD()) 
rule is applied to every state u according to its eligibility e(u), rather than just 
to the immediately previous state s [10]: 


V(u) — V(u) + [r+ V(s') — V(s)] e(u). (18.2) 
The eligibility trace can be defined by 
t 
elu) = S09) E Su sr (18.3) 
k=1 
where ĝu, s, = 1 if u = sp or 0 otherwise, and y is the discount factor. The eligi- 


bility can be updated online by 


(18.4) 


Mie yAe(u) +1, ifu = current state 
yAe(u), otherwise l 
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The eligibility trace represents the total credit assigned to a state for any subse- 
quent errors in evaluation. If all states are visited infinitely many times, and with 
appropriate step sizes, TD(A) converges to the value of the policy Vp for any À 
[4]. TD(A) often converges considerably faster for large A, but is computationally 
more expensive. 

The kKNN-TD(A) algorithms [12] are a series of general-purpose reinforce- 
ment learning algorithms for linear function approximation, which are based 
on temporal-difference learning and weighted k-NN. These algorithms are able 
to learn quickly, to generalize properly over continuous state spaces and also to 
be robust to a high degree of environmental noise. A derivation of KNN-TD(A) is 
described for problems where the use of continuous actions has clear advantages 
over the use of fine-grained discrete actions. 

Temporal-difference learning learns the value policy by using the update rule 
[20] 


Qi (s, a) = Q:(s, a) +N (x — Q:(s,a)) , (18.5) 


where Q values represent the possible reward received in the next step for taking 
action a in state s, plus the discounted future reward received from the next 
state-action observation, ņ is the learning rate, and 


Tı = Tizi + ymax Qi(st+1,4), (Q-learning [28]), (18.6) 


Lt = Tty1 + VQilSt+1, @t+1), (SARSA [22]), (18.7) 


where r41 is the immediate reward, y is the discount factor, s and a correspond 
the current state and action, and s;;; and a, denote the future state and 
action. Dynamic programming allows one to select an optimal action for the 
next state on a decision level, as long as all actions have been evaluated until 
time t. 

When interacting with the environment, a SARSA agent updates the pol- 
icy based on actions taken, while Q-learning updates the policy based on the 
maximum reward of available actions. Q-learning learns Q values with better 
exploitation policy compared with SARSA. SARSA employs on-policy update, 
while Q-learning uses off-policy update. SARSA is also called approximate pol- 
icy iteration, and Q-learning is the off-policy variant of the policy iteration algo- 
rithm. 

Using an action selection strategy of the exploratory nature may considerably 
reduce the run time of the algorithm. An exploratory strategy selects the action 
that has the highest Q value with a high probability and selects the other actions 
with a low non-zero probability. When such an exploratory strategy is pursued, 
the simulated learning agent will select the non-greedy (exploratory) action with 
a probability diminishing with iterations; the actual action will finally be the 
greedy action, which is the action prescribed by the policy learned by the algo- 
rithm. 
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(Q-learning 


In Q-learning [27] the approximation to the optimal action value function takes 
place independently of the evaluation policy by using only the path with the 
greatest action value to calculate a one-periodic difference. Q-learning is the 
most widely used reinforcement learning algorithm for addressing the control 
problem because of its off-policy update, which makes the convergence control 
easier. 

In Q-learning, the objective is to learn the expected discounted reinforcement 
values, Q*(s, a), of taking action a in state s by always choosing actions optimally. 
Assuming the best action is taken initially, we have the value of s as V*(s) = 
max, Q*(s,a) and the optimal policy 7*(s) = arg max, Q*(s,a), which chooses 
an action just by taking the one with the maximum Q value for the current state. 
The Q values can be estimated online using a method essentially the same as 
TD(0), and are used to define the policy. 

The Q-learning rule is given by 


Qla) = Q(s.a) +n (ry max Quran) = Qla); (188) 


where r is the immediate reward, A is the action space, s is the current state, 
and s;41 is the future state. The next action is the one with the highest Q value. 

If each action is executed in each state an infinite number of times on an infinite 
run and 7 is decayed appropriately, the Q values will converge to the optimal 
values with probability 1 to Q* [27], independent of how the agent behaves while 
the data is being collected. For these reasons, @-learning is a popular and effective 
model-free algorithm for learning from delayed reinforcement. However, it does 
not address any of the issues involved in generalizing over large state and/or 
action spaces. In addition, it may converge quite slowly to a good policy. 

Q-learning can also be extended to update states that occurred more than one 
step previously, as in TD(A) [14]. When the Q values nearly converge to their 
optimal values, it is appropriate for the agent to act greedily, taking the action 
with the highest Q value in each situation. However, it is difficult to make an 
exploitation-exploration tradeoff during learning. 


Example 18.1: A Java applet of Q-learning (http://thierry.masson.free. 
fr/IA/en/qlearning_applet .htm, written by Thierry Masson) allows the user 
to construct a grid with danger (red), neutral and target (green) cells, and to 
modify various learning parameters. 

An agent (or robot) has to learn as to how to move on a grid map: it learns to 
avoid dangerous cells and to reach target cells as quicly as possible. The agent 
is able to move from cell to cell, using one of the four directions (north, east, 
south and west). It is also able to determine the nature of the cell it currently 
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Figure 18.3 Learning result of Q-learning: The learned policy. 


occupies. The grid is a square closed domain — the agent cannot escape, but 
may hit the domain bounds. 

As long as it explores the grid, the agent receives a reward for each move it 
makes. It receives a reward of 0.0 if entering a neutral cell, a penalty of —5000.0 
if entering a dangerous cell, and a reward of +1000.0 if entering a target cell. 
The maximum iterations is set to 1000000, the exploration probability is set to 
e = 0.8, and the learning rate is 7 = 0.05, and the discount factor is y = 0.9. The 
exploration strategy chooses a random action with probability €, and leaves the 
original action unchanged with probability 1 — e. The route is set to be digressive. 
These parameters can be adjusted. 

At the beginning of the learning process, or anytime the agent has hit a target 
cell, a new exploration process begins, starting from a randomly chosen position 
on the grid. And then, the agent restarts to explore the domain. After running 
the Java applet on a PC with Core2 Duo CPU at 2.10 GHz and 1 GB memory, 
the target has been hit 119934 times, and the approximate learning time is 1.828 
s. Figure 18.3 gives the learned policy which is represented as arrows overlaying 
the grid. The policy denotes which direction the agent should move in from each 
square of the grid. 
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Learning automata 


Learning automata, a branch of the theory of adaptive control, were originally 
described explicitly as finite state automata [25]. A learning automaton is an 
agent situated in a random environment that learns the optimal action through 
repeated interactions with its environment. The actions are chosen according to 
a specific probability distribution, which is updated based on the environment 
response the agent obtains by performing a particular action. Learning automata 
is a reinforcement learning method that directly manipulates the policy. 

The linear refinforcement scheme is a simple sample algorithm. Let p; be the 
agent’s probability of taking action i. When action a; succeeds, it gets rewarded, 
and its probability is increased while the probabilities of all other actions are 
decreased: 


pi(t + 1) = pi(t) + a(1 — p:(t)), 
pj(t +1) =(1—a)p,(t), for j Fi. (18.9) 


where a € (0,1) is the reward parameter. Similarly, when action a; fails, the 
probabilities are defined by 


pi(t +1) = (1— 8)pi(t), 
pj(t +1) =8/(na-1) + (1— B)pj(t), for 7 Fi. (18.10) 


where 8 € (0,1) is the penalty parameter, and na is the number of actions. 
When 8 = a, we get the linear reward-penalty (Lpr-p) algorithm. When £8 = 0, 
pi remains unchanged for all i, and we get the linear reward-inaction (LR_r) 
algorithm [10]. In the Lr_; scheme, the action probabilities are updated in the 
case of a reward response from the environment, but no penalties are assessed. 





Linear reward-inaction (Lpr-z) converges with probability 1 to a probability 
vector containing a single 1 and the rest 0’s, that is, choosing a particular action 
with probability 1. Unfortunately, it does not always converge to the correct 
action; but the probability that it converges to the wrong one can be made 
arbitrarily small by making a small [13]. Linear reward inaction (Lr_r) is proven 
to be e-optimal [13]. 

A single automaton is generally sufficient for learning the optimal value of 
one parameter. However, for multi-dimensional optimization problems, a system 
consisting of as many automata as the number of parameters is needed. Such a 
system of automata can be a game of automata. 

Let Ay,...,Ay be the automata involved in an N-player game. Each play 
of the game consists of all automaton players choosing their actions and then 
getting the payoffs (or reinforcements) from the environment for the actions. Let 
p,(k),...,Dn(k) be the action probability distributions of the N automata at 
the kth play. Each automaton A; chooses an action a‘(k) independently and at 
random according to p;(k), 1 < i < N. Essentially, each player wants to maxi- 
mize its payoff r’(k). Since there are multiple payoff functions, learning can be 
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targeted to reach a Nash equilibrium. It is known that if each of the automata 
uses an Lpr-z algorithm for updating action probabilities, then the game would 
converge to a Nash equilibrium [18]. Learning automata are throughly overviewed 
in [13], [24]. 


18.1 Show how the idea of reinforcement learning is implemented in the LVQ2 
algorithm. 


18.2 The RProp algorithm employs the idea of reinforcement learning. Describe 
how RProp implement this idea. 


18.3 Reinforcement learning is very useful for guiding robots through obstacles. 
Describe the process. 
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19.1 


Probabilistic and Bayesian networks 


Introduction 


The Bayesian network model was introduced by Pearl in 1985 [149]. It is the 
best known family of graphical models in artificial intelligence (AI). Bayesian 
networks are a powerful tool of common knowledge representation and reasoning 
for partial beliefs under uncertainty. They are probabilistic models that combine 
probability theory and graph theory. The formalism is sometimes called a causal 
probabilistic network or a probabilistic belief network. It possesses the character- 
istic of being both a statistical and a knowledge-representation formalism. The 
formalism is model-based, as domain knowledge can be structured by exploiting 
causal and other relationships between domain variables. 

Bayesian inference is widely established as one of the principal foundations for 
machine learning. A Bayesian network is essentially an expert system. It can be 
used for causality relationship modelling, uncertain knowledge representation, 
probabilistic inference and reply to probabilistic query. Probabilistic graphical 
models are particularly appealing due to their natural interpretation. 

The Bayesian network has wide applications in bioinformatics and medicine, 
engineering, classification, data fusion and decision support systems. A well- 
known application is in medical diagnosis systems [44], which give the diagnosis 
given symptoms. Bayesian networks are well suited to human-computer intel- 
ligent interaction tasks because they are easily mapped onto a comprehensible 
graphical network representation. Some well-known applications in Microsoft are 
technical support troubleshooters, such as the Office Assistant in Microsoft Office 
(“Clippy”), which observe some user actions and give helpful advice, Microsoft 
Windows help system, and automates fault diagnostics such as the printer fault- 
diagnostic system and software debugging [21]. 

Special cases of Bayesian networks were independently invented by many dif- 
ferent communities [172], such as genetics (linkage analysis), speech recognition 
(hidden Markov models), tracking (Kalman fitering), data compression (den- 
sity estimation) and coding (turbo codes). Inference is solved by the forward- 
backward algorithm, and the maximum a posteriori (MAP) problem is handled 
by the Viterbi algorithm [153, 172]. The forward-backward and Viterbi algo- 
rithms are directly equivalent to Pearl’s algorithms [149], which are valid for any 
probability model that can be represented as a graphical model. Kalman filters 
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and related linear models for dynamical systems are similar to HMMs, but the 
hidden state variables are real-valued rather than discrete. 


Classical vs. Bayesian approach 


There are two main opposing schools of statistical reasoning, namely frequentist 
and Bayesian approaches. The frequentist or classical approach has dominated 
scientific research, but Bayesianism (Thomas Bayes, 1702-1761) is changing the 
situation. Whereas a classical probability of an event X is a true or physical 
probability, the Bayesian probability of an event X is a person’s degree of belief 
in that event, thus known as a personal probability. Unlike physical probability, 
the measurement of the personal probability does not need repeated trials. 

In the classical approach of learning from data, the parameter 6 is fixed, and all 
data sets of size N are assumed to be generated by sampling from the distribution 
determined by 0. Each data set D occurs with some probability p(D|@) and will 
produce an estimate 6*(D). To evaluate an estimator, we obtain the expectation 
and variance of the estimate: 


Eppo) (9 = Darl D0)" (D (19.1) 


vat p(D\9) ( = Dav DIO) [ Eppjo) (0* le (19.2) 


An estimator that balances the bias 0 — E,:pjg)(6*) and the variance can be 
chosen. A commonly-used estimator is the ML estimator, which selects the value 
of 0 that maximizes the likelihood p(D|@). 

In the Bayesian approach, D is fixed, and all values of 6 are assumed to be 
possibly generated. The estimate of 0 is the expectation of 0 with respect to our 
posterior beliefs about its value: 


p(O|D, a (0 = fo (A|D, Ed (19.3) 


where € is state of information (background knowledge). 
The estimations given by (19.1) and (19.3) are different, and in many cases, 
leads to different estimates. This is due to their different definitions. 


Bayes’ theorem 


Bayes’ theorem is the origin and fundamental of the Bayesian approach. In this 
chapter, we use P(-) to denote a probability. 


Definition 19.1 (Conditional probability). Let A and B be two events, the 
conditional probability of A conditioned on B is defined by 
P(A, B) 


P(A|B) = ~P(B) ’ 


(19.4) 
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where P(A, B) is the joint probability of both A and B happening. 


This leads to the chain rule: P(A, B) = P(A|B)P(B). 
Given the occurrence of evidence E depending on a hypothesis H, i.e. H — 
E, the probability for both E and H to happen can be expressed as: 


P(H, E) = P(H)P(E|B). (19.5) 
By symmetry, 
P(H, E) = P(A)P(E\A) = P(E, H) = P(E)P(A\E), (19.6) 


from which we obtain Bayes’ theorem. Bayes’ theorem forms the basis of the 
probability distributions between the nodes of the Bayesian network. 


Theorem 19.1 (Bayes’ theorem). Given the occurrence of evidence E 
depending on hypothesis H, there is a relation between the probabilities: 


P(E)P(HI|E) 


P(EIH) = Sa 


: (19.7) 

From a statistical point of view, P(E|H) denotes the conditional probability 
of (belief in) evidence E caused by the hypothesis H. P(H|E) is the a posteri- 
ori belief in H faced with evidence E; it means the probability value such that 
when evidence F is detected the degree of belief that hypothesis H has actu- 
ally occurred. P(H) denotes a priori belief in hypothesis H. P(E) is the prior 
probability of evidence EF. 

A conventional statistical classifier is the optimal, parametric Bayes classifier. 
Bayesian design assumes functional forms for the densities in the mixture model 
and estimates the parameters. 


Graphical models 


Graphical models are graphs in which nodes represent random variables. The 
arcs between nodes represent conditional dependence. Undirected graphical mod- 
els are called Markov networks or Markov random fields. Markov networks are 
popular with the physics and vision communities [67]. They are also used for 
spatial data mining. Examples of undirected probabilistic independent networks 
are Markov networks [67] and Boltzmann machines [2]. 


Definition 19.2 (Conditional independence). Two sets of nodes A and B 
are said to be conditionally independent, if all paths between the nodes in A 
and B are separated by a node in a third set C. More specifically, two random 
variables X and Y are conditionally independent given another random variable 
Z if 

P(X|Z) = P(X|Y, Z). (19.8) 
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OD 


Figure 19.1 A simple Bayesian network. 


19.2 


Definition 19.3 (d-separation). Let V be the set of nodes. Two variables A 
and B in a Bayesian network are d-separated by X C V if all paths between A 
and B are blocked by X. 


Directed graphical models without directed cycles are called Bayesian net- 
works; that is, Bayesian networks are directed acyclic graphs (DAGs). Indepen- 
dence defined for Bayesian networks has to take into account the directionality 
of the arcs. An arc from node A to node B can be regarded as A causes B. 

Dynamic Bayesian networks are directed graphical models of stochastic pro- 
cesses. They generalize hidden Markov models (HMMs) and linear dynamical 
systems by representing the hidden and observed state in terms of state vari- 
ables. 


Example 19.1: The Bayesian network Z — Y — X is shown in Fig. 19.1. We 
have P(X|Y,Z) = P(X|Y), since Y is the only parent of X and Z is not a 
descendant of X (that is, X is conditionally independent of Z). Prove that 
independence is symmetric, that is, P(Z|Y, X) = P(Z|Y). 

Proof. Since P(X|Y, Z) = P(X|Y), we have 


P(X,Y|Z)P(Z) 





P(Z|X,Y) = — PAY) (Bayes’ rule) 
_ PYIZ)P(XIY, Z)P(Z) | 
=— PAPY) (Chain rule) 
_ PYIZ)P(XIY)P(Z) | 
=— PAY (By assumption) 
a Y|Z)P(Z) 


PY) P(Z|Y) (Bayes rule). 


That completes the proof. 


Bayesian network model 


A Bayesian network encodes the joint probability distribution of a set of v vari- 
ables in a problem domain, V = {X1,..., Xv}, asa DAG G = (V, E). Each node 
of this graph represents a random variable X; in ¥ and has a conditional prob- 
ability table (CPT). Arcs stand for conditional dependence relationship among 


these nodes. 


ww ai bbt.com DOOOO000 


Probabilistic and Bayesian networks 593 


It is easy to identify the parent-child relationship or the probability depen- 
dency between two nodes. The parents of X; are denoted by pa(X;); the children 
of X; are denoted by ch(X;); and spouses of X; (other parents of X;’s children) 
are denoted by spo(X;). A CPT contains probabilities of the node being a spe- 
cific value given its parents, that is, a CPT specifies the conditional probabilities 
P(u|pa(X;)), where u is a configuration of the parents of X;. 


Definition 19.4 (Markov blanket). The Markov blanket for node X; is a set 
of all parents of Xi, children of Xi, and spouses of Xi. 


In Bayesian networks, any variable X; is independent of variables outside of 
Markov blanket of X;, that is, P(X;|X zi) = P(X;|Markov blanket(X;)). 

The Bayesian network model is based on the Markov condition: every variable 
is independent of its nondescendant nonparents given its parents. This leads to 
a unique joint probability density: 


P(X) = [][p%ipax)), (19.9) 


where each X; is associated with a conditional probability density p(X;|pa(X;)). 
This holds as long as the network was designed such that pa(X;) € 
{Xiqi,-.-,Xn}. X; is conditionally independent from all X; € pa(X;). 

A node can represent a discrete random variable that take values from a finite 
set, and a numeric or continuous variable that takes values from a set of continu- 
ous numbers. Bayesian networks can thus be classified into discrete, continuous, 
and mixed Bayesian networks. 

When a Bayesian network is used in conjunction with statistical techniques, 
the graphical model has several advantages for data analysis. As the model 
encodes dependencies among all variables, it readily handles situations where 
some data entries are missing. It is an ideal representation for combining prior 
knowledge and data. 

There are some inherent limitations on Bayesian networks. Directly identifying 
the Bayesian network structures from input data D remains a challenge. While 
the resulting ability to describe the network can be performed in linear time, 
this process of network discovery is an NP-hard task for practically all model 
selection criteria such as AIC, BIC and marginal likelihood [35, 44]. The second 
problem is concerned with the quality and extent of the prior beliefs used in 
Bayesian inference. A Bayesian network is useful only when this prior knowledge 
is reliable. Selecting a proper distribution model to describe the data has a 
notable effect on the quality of the resulting network. 

Multiply sectioned Bayesian networks [202] relax the single Bayesian network 
paradigm. The framework allows a large domain to be modeled modularly and 
the inference to be performed distributively, while maintaining the coherence. 
Inference in a Bayesian network can be performed effectively using its junc- 
tion tree representation. The multiply sectioned Bayesian network framework is 
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Figure 19.2 A Bayesian network with two of its CPTs. 


an extension of these junction tree based inference methods with the HUGIN 
method [93] the most relevant. Multiply sectioned Bayesian networks provide 
a coherent and flexible formalism for representing uncertain knowledge in large 
domains. Global consistency among subnets is achieved by communication. 


Example 19.2: Probabilistic queries with respect to a Bayesian network are inter- 
preted as queries with respect to the CPT the network specifies. Figure 19.2 
depicts a simple Bayesian network with two of its CPTs: 

The first CPT: P(X = T) = 0.3, P(X = F) = 0.7. 

The second CPT: P(Y =T|X =T)=0,P(Y =F\|X =T)=1,P(Y =T|X = 
F) = 0.8, P(Y =F|X =F) = 0.2. 


Example 19.3: Assume the long-term experience with a specific kind of 
tumor is P(tumor) = 0.01 and P(no tumor) = 0.99. Tumor may cause posi- 
tive testing result. The Bayesian network representation gives that the (causal) 
direction from tumor to positive is extracted from P(positive|tumor). The 
CPT for tumor — positive is P(positive|tumor) = 0.7, P(negative|tumor) = 
0.3, P(positive|no tumor) = 0.2, P(negative|no tumor) = 0.8. We now solve for 
P(tumor|positive). 
Solution. We have 


P(positive) = P(positive|tumor) P(tumor) + P(positive|no tumor) 
x P(no tumor) = 0.7 x 0.01 + 0.2 x 0.99 = 0.205. 
From Bayes’ theorem, 


P(positive|tumor)P(tumor) _ 0.7 x 0.01 


= = 0.0341. 
P(positive) 0.205 i 


P(tumor|positive) = 


After a positive test the probability of a tumor increases from 0.01 to 0.0341. 


Probabilistic relational models [138] extend the standard attribute-based 
Bayesian network representation to incorporate a much richer relational struc- 
ture. A probabilistic relational model, together with a particular database of 
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Figure 19.3 The Asia network. 


19.3 


objects and relations, defines a probability distribution over the attributes of 
the objects and the relations. A unified statistical framework for content and 
links [68] builds on probabilistic relational models. The standard approach for 
inference with a relational model is based on the generation of a propositional 
instance of the model in the form of a Bayesian network, and then applying clas- 
sical algorithms, such as jointree [93], to compute answers to queries. A relational 
model describes a situation involving themes succinctly. This makes constructing 
a relational model much easier and less error-prone than constructing a Bayesian 
network. A relational model with a dozen or so general rules may correspond to 
a Bayesian network that involves hundreds of thousands of CPT parameters. 

Cumulative distribution networks [88] is a class of graphical models for directly 
representing the joint cumulative distribution function of many random variables. 
In order to perform inference in such models, we describe the derivative-sum- 
product message-passing algorithm in which messages correspond to derivatives 
of the joint cumulative distribution function. 


Data sets 

Well-known benchmarks of Bayesian network learning algorithms include the 
Asia [111], Insurance [16], and Alarm [14] networks. The Asia network, shown 
in Fig. 19.3, is a small network that studies the effect of several parameters 
on the incidence of having lung cancer. The network has 8 nodes and 8 edges. 
The Insurance network was originally used for evaluating car insurance risks; it 
contains 27 variables and 52 edges. The Alarm network is used in the medical 
domain for potential anesthesia diagnosis in the operating room; it has 37 nodes, 
of which many have multiple values, and 46 directed edges. 


Learning Bayesian networks 
Learning a Bayesian network from data requires the construction of the structure 


and CPTs from a given database of cases. It requires to learn the structure, the 
parameters for the structure (i.e. the conditional probabilities among variables), 
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hidden variables and missing values. Learning the structure is a much more 
challenging problem than estimating the parameters. 

Given a training set D = {a1,..., £y }, the goal is to find a Bayesian network 
B that approximates the joint distribution p(a). The network B can be found 
by maximizing the likelihood or the log-likelihood of the data: 


N N v 

L(B|D) = X` logpg(æn) = XC X log ps (£n,i|Pan i). (19.10) 
n=1 n=1i=l1 

where pa, ; is the parent set of the ith variable for the data point £n. 

The physical joint probability distribution for D is encoded in a network 
structure S. The problem of learning probabilities in a Bayesian network can 
now be stated: Given a random sample D, compute the posterior distribution 
p(Os|D; S”), where Og is the vector of parameters (01; ...; 0v), 0; being the vec- 
tor of parameters for node i, and S” denotes the event that the physical joint 
probability distribution can be factored according to the network structure S. 

We have 


v 
p(æ|0s, S”) = || p(ailpa,, 0:, S”). (19.11) 
i=1 
where p(x;i|pa;, 0:, S”) is a local distribution function. Thus, a Bayesian network 
can be viewed as a collection of probabilistic models, organized by conditional- 
independence relationships. 


Learning the structure 


Structural learning of a Bayesian network can generally be divided in two classes 
of methods: independence analysis-based methods and score-based methods. 
Score-based approach [105] maps every structure of Bayesian network to a score 
and searches into the space of all structures for a good Bayesian network that fits 
the data set best. Exhaustive search for the best network structure is NP-hard, 
even from a small-sample data set and when each node has at most two parents 
[35]. Identifying high-scoring DAGs from large data sets when using a consistent 
scoring criterion is also NP-hard [37]. A stochastic optimization method is usu- 
ally used to search for the best network structure, such as greedy search, iterated 
hill climbing and simulated annealing. Score-based algorithms are more robust 
for small data sets, and it works with a wide range of probabilistic models. 
Independence analysis-based algorithms do not require computation of the 
parameters of the model during the structure discovery process, and thus are effi- 
cient. They are generally more efficient than the score-based approach for sparse 
networks. However, most of these algorithms need an exponential number of con- 
ditional independence tests. Conditional independence tests with large condition- 
sets may be unreliable unless the volume of the data set is enormous [44]. The 
conditional independence approach is equivalent to minimizing Kullback-Leibler 
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divergence using the score-based approach [45]. Hybrid methods take advantage 
of both the approaches [188]. 

An incremental structural learning method gradually modifies a Bayesian net- 
work structure to fit a sequential stream of observations, where the underlying 
distribution can change during the sampling of the database [139]. 


Independence analysis-based approach 

Exploiting the fact that the network structure represents the conditional inde- 
pendence, independence analysis-based methods find causal relations between 
the random variables, and deduce the structure of the graph [174]. They conduct 
a number of conditional independence tests on the data, successively constrain 
the number of possible structures consistent with the results of those tests to 
a singleton (if possible), and infer that the structure as the only possible one. 
The tests are usually done by using statistical [174] or information-theoretic [32] 
measures. The independence-based approach has been exemplified by the SGS 
[174], PC [174], GES [36], and grow-shrink [129] algorithms. 

In an independence analysis-based method, search for separators of vertex 
pairs is a key issue for orientation of edges and for recovering DAG structures 
and causal relationships among variables. To recover structures of DAGs, the 
inductive causation algorithm [194] searches for a separator of two variables 
from all possible variable subsets such that the two variables are independent 
conditionally on the separator. A systematic way of searching for separators in 
increasing the order of cardinality is implemented in the PC algorithm [174]. 

When ignoring the directions of a DAG, one gets the skeleton of a DAG. The 
PC algorithm starts from a complete, undirected graph and deletes recursively 
edges based on conditional independence decisions. It runs in the worst case in 
exponential time with respect to the number of nodes, but if the true under- 
lying DAG is sparse, which is often a reasonable assumption, this reduces to 
a polynomial runtime [96]. The PC algorithm is asymptotically consistent for 
the equivalence class of the DAG and its skeleton with corresponding very high- 
dimensional, sparse Gaussian distribution [96]. It is computationally feasible for 
such high-dimensional, sparse problems. The R-package CPDAG can be used 
to estimate from data the underlying skeleton or equivalence class of a DAG. 
For low-dimensional problems, there are a number of other implementations of 
the PC algorithm: Hugin, Murphy’s Bayes Network Toolbox and Tetrad IV. An 
extensive comparative study of different algorithms is given in [188]. 

In [203], a recursive method for structural learning of DAGs is proposed, in 
which the problem of structural learning for a DAG is recursively decomposed 
into two problems of structural learning for two vertex subsets until no subset 
can be decomposed further. Search for separators of a pair of variables in a 
large DAG is localized to small subsets, and thus the approach can improve the 
efficiency of searches and the power of statistical tests for structural learning. 
These locally learned subgraphs are finally gradually combined into the entire 
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DAG. Statistical test is used to determine a skeleton as in the inductive causation 
algorithm [194] and the PC algorithm [174]. 


Score-based approach 

The score of a structure is generally based on Occam’s razor principle. The score 
can be based on penalized log-likelihood such as AIC, BIC and MDL criteria, 
or Bayesian scoring methods such as K2 [44] and BDeu [78]. The Bayesian score 
is equivalent to the marginal likelihood of the model given the data. For most 
criteria, Bayesian network structures are interpreted as independence constraints 
in some distribution from which the data was generated. 

The K2 criterion applied to a DAG G evaluates the relative posterior proba- 
bility that the generative distribution has the same independence constraints as 
those entailed by G. The criterion is not score-equivalent because the prior dis- 
tributions over parameters corresponding to different structures are not always 
consistent. The (Bayesian) BDeu criterion [78] measures the relative posterior 
probability that the distribution from which the data were generated has the 
independence constraints given in the DAG. It uses a parameter prior that has 
uniform means, and requires both a prior equivalence sample size and a structure 
prior. 

Although the BIC score decomposes into a sum of local terms, one per node, 
local search is still expensive, because we need to run EM at each step. An 
alternative iterative approach is to do the local search steps inside the M-step of 
EM: this is called structural EM, and provably converges to a local maximum 
of the BIC score [61]. Structural EM can adapt the structure in the presence 
of hidden variables, but usually performs poorly without prior knowledge about 
the cardinality and location of the hidden variables. 

A general approach for learning Bayesian networks with hidden variables [55] 
builds on the information bottleneck framework and its multivariate extension. 
The approach is able to avoid some of the local maxima in which EM can get 
trapped when learning with hidden variables. The algorithmic framework allows 
learning of the parameters as well as the structure of a network. 

The Bayesian BDeu criterion is very sensitive to the choice of prior hyper- 
parameters [170]. AIC and BIC are derived through asymptotics and their behav- 
ior is suboptimal for small sample sizes. It is hard to set the parameters for the 
structures selected with AIC or BIC. Factorized normalized ML [170] is an effec- 
tive scoring criterion, with no tunable parameters. The combination of the factor- 
ized normalized ML criterion and a sequential normalized ML-based parameter 
learning method yields a complete non-Bayesian method for learning Bayesian 
networks [171]. The approach is based on the minimax optimal normalized ML 
distribution, motivated by the MDL principle. Computationally, the method is 
parameter-free, robust, and as efficient as its Bayesian counterparts. 

Algorithm B [23] is a greedy construction heuristic. It starts with an empty 
DAG and adds at each step, the arc with the maximum increase in the (decom- 
posable) scoring metric such as BIC, but avoiding the inclusion of directed cycles 
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in the graph. The algorithm ends when adding any more valid arc does not 
increase the value of the metric. 

An exact score-based structure discovery algorithm is given for Bayesian net- 
works of a moderate size (say, 25 variables or less) by using dynamic program- 
ming in [102]. A parallel implementation of the score-based optimal structure 
search using dynamic programming has O(n2”) time and space complexity [179]. 
It is possible to learn the best Bayesian network structure with over 30 variables 
(http://b-course.hiit.fi/bene) [169]. 

A distributed algorithm for computing the MDL in learning Bayesian networks 
from data is presented in [106]. The algorithm exploits both properties of the 
MDL-based score metric and a distributed, asynchronous, adaptive search tech- 
nique called nagging. Nagging is intrinsically fault-tolerant, has dynamic load 
balancing features, and scales well. The distributed algorithm can provide opti- 
mal solutions for larger problems as well as good solutions for Bayesian networks 
of up to 150 variables. 

The problem of learning Bayesian network structures from data based on score 
functions that are decomposable is addressed in [24]. It describes properties that 
strongly reduce the time and memory costs of many known methods without 
losing global optimality guarantees. These properties are derived for different 
score criteria such as MDL (or BIC), AIC and Bayesian Dirichlet criterion. A 
branch-and-bound algorithm integrates structural constraints with data in a way 
to guarantee global optimality. In [31], a conditional independence test-based 
approach is used to find node-ordering information, which is then fed to the K2 
algorithm for structure learning. The performance of the algorithm is mainly 
dependent on the stage that identifies the order of the nodes, which avoids expo- 
nential complexity. 

An asymptotic approximation of the marginal likelihood of data is presented 
in [160], given a naive Bayesian model with binary variables. It proves that the 
BIC score that penalizes the log-likelihood of a model by {inN is incorrect for 
Bayesian networks with hidden variables and suggests an adjusted BIC score. 
Moreover, no uniform penalty term exists for such models in the sense that the 
penalty term depends on the averaged sufficient statistics. This claim stands 
in contrast to linear and curved exponential families, where the BIC score has 
been proven to provide a correct asymptotic approximation for the marginal 
likelihood. 

The sparse Bayesian network structure learning algorithm proposed in [89] 
employs a formulation involving one Lı-norm penalty term to impose sparsity 
and another penalty term to ensure that the learned Bayesian network is a 
DAG. The sparse Bayesian network has a computational complexity that is linear 
in the sample size and quadratic in the number of variables. This makes the 
sparse Bayesian network more scalable and efficient than most of the existing 
algorithms. 
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The K2 Algorithm 

The basic idea of the Bayesian approach is to maximize the probability of the 
network structure given the data, that is, to maximize P(Bgs|D) over all possibe 
network structures Bs given the cases of the data set D. 

The K2 algorithm [44] is a well-known greedy search algorithm for learning 
structure of Bayesian networks from the data. It uses a Bayesian scoring metric 
known as K2 metric, which measures the joint probability of a Bayesian network 
G and a data set D. K2 metric assumes that a complete data set D of sample 
cases over a set of attributes that accurately models the network is given. The 
K2 algorithm assumes that a prior ordering on the nodes is available and that all 
structures are equally likely. It searches, for every node, the set of parent nodes 
that maximizes the K2 metric. If there is a predefined ordering of variables, the 
K2 algorithm can efficiently determine the structure. 

A probabilistic network B over the set of variables U is a pair B = (Bs, Bp) 
where the network structure Bs is a DAG with a node for every variable in 
U and Bp is a set of CPTs associated with Bs. For every variable X; € U, 
the set Bp contains a CPT P(X;|pa,) that enumerates the probabilities of all 
values of X; given all combinations of values of the variables in its parent set 
pa, in the network structure Bs. The network B represents the joint probability 
distribution represented P(U) defined by P(U) = Į Ji- P(Xi|pa,). 

The K2 algorithm is a greedy heuristic algorithm for selecting a network struc- 
ture that considers at most O(n?) different structures for n nodes. All nodes are 
considered independent of one another. For each node, a parent set is calculated 
by starting with the empty parent set and successively adding to the parent set 
the node that maximally improves P(Bs,P) until no more node can be added 
such that P(Bg,D) increases, for data set D and all possible network structures 
Bg. A major drawback of the K2 algorithm is that the ordering that is chosen 
on the nodes influences the resulting network structure and the quality of this 
structure. The K2 algorithm is given in Algorithm 19.1. 

K3 [17] is a modification of K2 where the Bayesian measure is replaced by 
the MDL measure and a uniform prior distribution over network structures is 
assumed. The MDL measure is approximately equal to the logarithm of the 
Bayesian measure. The results are comparable for K2 and K3, but K3 tends 
to be slightly faster, outputting network structures with fewer arcs than K2. A 
major drawback of both K2 and K3 is that their performance is highly dependent 
on the ordering on the variables taken as point of departure. The two measures 
have the same properties for infinite large databases. For smaller databases, the 
MDL measures assign equal quality to networks that represent the same set of 
independencies while the Bayesian measure does not. 
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Algorithm 19.1 (Algorithm K2). 


1. Let the variables of U be ordered X1,...,Xn.- 
2. fori=1,...,n 

pa; — pap? — O. 
3. fori= 2,602 n 


repeat until (pa}°Y = pa 


new 

l 
b. Let Bs be defined by pa!¢,..., pacl4. 
c. 


9 or [pa?e™| = i — 1): 


P(Bg,, D) 


L Sri Y E {Xi X er 
P(Bs,D) | E {Xie 1} pasoa 


Z — arg max { 
Y 


where Bs, is Bs but with pa; = pa? U {Z}. 
4. Output Bs defined by pat’... pare. 





Learning the parameters 


The goal of parametric learning is to find the parameters of each cumulative 
probability density that maximizes the likelihood of the training data. The ML 
method leads, with the classical decomposition of the joint probability in a prod- 
uct, to estimate separately each term of the product with the data. It asymptot- 
ically converges toward the true probability, if the proposed structure is exact. 
The Bayesian method tries to calculate the most probable parameters given the 
data, and this is equivalent to weight the parameters with an a priori law. The 
most used prior is the Dirichlet distribution. The VC dimension of the set of 
instanced Bayesian networks is upper-bounded by the number of parameters of 
the set [66]. 

When learning the parameters of a Bayesian network with missing values or 
hidden variables, the common approach is to use the EM algorithm [52]. EM 
performs a greedy search of the likelihood surface and converges to a local sta- 
tionary point. EM is faster and simpler than the gradient-descent method since 
it uses the natural gradient. 

The concavity of the log-likelihood surface for logistic regression is a well- 
known result. The condition, under which Bayesian network models correspond 
to logistic regression with completely freely varying parameters, is given in [157]. 
Only then can we guarantee that there are no local maxima in the likelihood 
surface. 

An inductive transfer learning method for Bayesian networks [125] induces 
a model for a target task from data of this task and of other related auxiliary 
tasks. The method includes both structure and parameter learning. The structure 
learning method is based on the PC algorithm, and it combines the dependency 
measures obtained from data in the target task, with those obtained from data 
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in the auxiliary tasks. The parameter learning algorithm uses an aggregation 
process, combining the parameters estimated from the target task, with those 
estimated from the auxiliary data. A significant improvement is observed in 
terms of structure and parameters when knowledge is transferred from similar 
tasks [125]. 


Online learning of the parameters 
Let Z; be a node in the network that takes any value from the set {z},..., 2] '}. 
Let Pa; be the set of parents of Z; in the network that takes one of the configura- 
tions denoted by {pa},... pai}. An entry in the CPT of the variable Z; is given 
by ijk = P(Z; = z¥|Il; = pa). We are given a data set D = {y41,...; Yp}, 
and we have a current set of parameters @ that defines the network. The data 
set is either complete or incomplete. 

The network parameters are updated by 








0 = arg max[nLn(8) — d(0,8)], (19.12) 


where Lp(@) is the normalized log-likelihood of the data given the network, 

d(0,0) is the y?-distance between the two models and 7 is the learning rate. 
Online learning of Bayesian network parameters has been discussed in [173], 

[11]. EM(7) [11] is derived by solving the maximization subject to the con- 

straint that 5°, 0:;, = 1Vi, j. Voting EM [40] is obtained by adapting EM(7) to 

online learning. Voting EM takes a frequentist approach while the Spiegelhalter- 

Lauritzen algorithm [173] uses a Bayesian approach to parameter estimation. 
Voting EM is given by 


P(zk, pally, 0t- 1)) 


Oisn(t) = bijet — 1) + 
Be GE) Dee Ole = Ay) 


— Diz (t — 1) 


if P(paj|y,,A(t—1)) #0; (19.13) 


Oije (t) = ijn (t = 1) otherwise. (19.14) 


The learning rate 7 € (0,1) controls the influence of the past; it can be selected 
by the Robbins-Monro conditions. 

Mixtures of truncated exponentials is a model for dealing with discrete and 
continuous variables simultaneously in Bayesian networks without imposing any 
restriction on the network topology and avoiding rough approximations of meth- 
ods based on the discretization of the continuous variables [135]. A method for 
inducing the structure of such network from data is proposed in [156]. 


Constraint-handling 


Constraints can be embedded within belief networks by modeling each constraint 
as a CPT. One approach is to add a new variable for each constraint that is 
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perceived as its effect (child node) in the corresponding causal relationship and 
then to clamp its value to true [43], [149]. 

The use of several types of structural restrictions within algorithms for learn- 
ing Bayesian networks is considered in [51]. These restrictions may codify expert 
knowledge in a given domain, in such a way that a Bayesian network represent- 
ing this domain should satisfy them. Three types of restrictions are formally 
defined: existence of arc and/or edge, absence of arc and/or edge, and ordering 
restrictions. Two learning algorithms are investigated: a score-based local search 
algorithm, with the operators of arc addition, arc removal and arc reversal, and 
the PC algorithm. 

A framework that combines deterministic and probabilistic networks [130] 
allows two distinct representations: causal relationships that are directional and 
normally quantified by CPTs and symmetrical deterministic constraints. In par- 
ticular, a belief network has a set of variables instantiated (e.g., evidence) as a 
mixed network, by regarding the evidence set as a set of constraints. 

Models that use different types of parameter sharing include dynamic Bayesian 
networks, HMMs, and Kalman filters. Parameter sharing methods constrain 
parameters to share the same value, but do not capture more complicated con- 
straints among parameters such as inequality constraints or constraints on sums 
of parameter values. 


Bayesian network inference 


After constructing a Bayesian network, one can obtain various probabilities from 
the model. The computation of a probability from a model is known as proba- 
bilistic inference. Inference using a Bayesian network, also called belief updating, 
is based on Bayes’ theorem as well as the CPTs. Given the joint probability 
distribution, all possible inference queries can be answered by marginalization 
(summing out over irrelevant variables). Bayesian network inference is to com- 
pute the inference probability P(X = 2|E = e), i.e., the probabilities of query 
nodes (X = x) given the values of evidence nodes (E = e). 

Exact methods exploit the independence structure contained in the network 
to efficiently propagate uncertainty [149]. A message-passing scheme updates 
the probability distributions for each node in a Bayesian network in response to 
observations of one or more variables. The commonly used exact algorithm for 
discrete variables is the evidence propagation algorithm in [111], improved later 
in [93], which first transforms the Bayesian network into a tree where each node 
in the tree corresponds to a subset of variables in X and then exploits properties 
of this tree to perform probabilistic inference. This algorithm needs an ordering 
of the variables in order to make the triangulation of the moral graph associated 
with the original Bayesian network structure. A graph is triangulated if it has 
no cycles with a length greater than three without a cord. Obtaining the best 
triangulation for a Bayesian network is an NP-hard problem. 
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Although we use conditional independence to simplify probabilistic inference, 
exact inference in an arbitrary Bayesian network for discrete variables is NP-hard 
[43]. Even approximate inference is NP-hard [46], but approximate methods per- 
form this operation in an acceptable amount of time. Approximating the infer- 
ence probability in any sense, even for a single evidence node, is NP-hard [47]. 
Some popular approximate inference methods are sampling methods, variational 
Bayesian methods, and loopy belief propagation [210]. A randomized approxi- 
mation algorithm, called the bounded-variance algorithm [47], is a variant of the 
likelihood-weighting algorithm. 

The explaining Bayesian network inferences [208] procedure explains how vari- 
ables interact to reach conclusions. The approach explains the value of a target 
node in terms of the influential nodes in the target’s Markov blanket under 
specific contexts. Working back from the target node, the approach shows the 
derivation of each intermediate variable, and finally explains how missing and 
erroneous evidence values are compensated. 


Belief propagation 


A complete graph is a graph with every pair of vertices joined by an edge. A 
clique is a complete subgraph—a set of vertices that are all adjacent to one 
another. In other words, the set of nodes in a clique is fully connected. A clique 
is maximal if no other vertices can be added to it so as to still yield a clique. 

Belief propagation is a popular method of performing approximate inference 
on arbitrary graphical models. The belief propagation algorithm, also called 
the sum-product algorithm, is a popular means of solving inference problems 
exactly on a tree, but approximately in graphs with cycles [149]. It calculates 
the marginal pdfs for random variables by passing messages in a graphical model. 
Belief propagation has its optimality for tree-structured graphical models (with 
no loops). It is also widely applied to graphical models with cycles for approx- 
imate solution. Some additional justifications for loopy belief propagation have 
been developed in [182]. 

For tree-structured graphical models, belief propagation can be used to effi- 
ciently perform exact marginalization. However, as observed in [149], one may 
also apply belief propagation to arbitrary graphical models by following the same 
local message-passing rules at each node and ignoring the presence of cycles in 
the graph; this procedure is typically referred to as loopy belief propagation. 

The goal of belief propagation is to compute the marginal distribution p(X;) 
at each node t. Belief propagation takes form of a message-passing algorithm 
between nodes, expressed in terms of an update to the outgoing message at 
iteration 7 from each node t to each neighbor s in terms of the previous itera- 
tion’s incoming messages from node t’s neighbors I+. Typically each message is 
normalized so as to integrate (sum) to unity. 

The sum-product algorithm [103] is a generic message-passing algorithm that 
operates in a factor graph. It computes, either exactly or approximately, var- 


ww ai bbt.com DOOOO000 


Probabilistic and Bayesian networks 605 


ious marginal functions derived from the global function. Factor graphs are a 
straightforward generalization of the Tanner graphs obtained by applying them 
to functions. Bounds on the accumulation of errors in the system of approxi- 
mate belief propagation message-passing are given in [92]. This analysis leads to 
convergence conditions for traditional belief propagation message-passing. 

A wide variety of algorithms developed in AI, signal processing, and digital 
communications can be derived as specific instances of the sum-product algo- 
rithm, including the forward-backward algorithm, the Viterbi algorithm, the 
iterative turbo or LDPC decoding algorithm [53], Pearl’s belief propagation algo- 
rithm [149] for Bayesian networks, the Kalman filter, and certain FFT algorithms 
[103]. The forward-backward algorithm, sometimes referred to as the BCJR or 
MAP algorithm in coding theory, is an application of the sum-product algorithm 
to HMM or to the trellises in which certain variables are observed at the output 
of a memoryless channel. The basic operations in the forward-backward recur- 
sions are therefore sums of products. When all codewords are a priori equally 
likely, MAP amounts to ML sequence detection. The Viterbi algorithm operates 
in the forward direction only; however, since memory of the best path is main- 
tained and some sort of traceback is performed in making a decision, it might 
be viewed as being bidirectional. 

Variational message-passing [199] applies variational inference to Bayesian 
networks. Like belief propagation, the method proceeds by sending messages 
between nodes in the network and updating posterior beliefs using local oper- 
ations at each node. In contrast to belief propagation, it can be applied to a 
very general class of conjugate-exponential models. By introducing additional 
variational parameters, it can be applied to models containing non-conjugate 
distributions. The method is guaranteed to converge to a local minimum of the 
Kullback-Leibler divergence. It has been implemented in a general-purpose infer- 
ence engine VIBES (http://vibes.sourceforge.net) which allows models to 
be specified graphically or in a text file containing XML. XMLBIF (XML for 
Bayesian networks interchange format) is an XML-based file format for repre- 
senting Bayesian networks. 


Loopy belief propagation 
In loopy belief propagation [210], Pearl’s belief propagation algorithm is applied 
to the original graph, even if it has loops (undirected cycles). Loopy belief propa- 
gation can be used to compute approximate marginals in Bayesian networks and 
Markov random fields. However, when applied to graphs with cycles, it does not 
always converge after any number of iterations. In practice, the procedure often 
arrives at a reasonable set of approximations to the correct marginal distribu- 
tions. Double-loop algorithms guarantee convergence [212], but are an order of 
magnitude slower than standard loopy belief propagation. 

The effect of n iterations of loopy belief propagation at any particular node s 
is equivalent to exact inference on a tree-structured unrolling of the graph from 
s. The computation tree with depth n consists of all length-n paths emanating 
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from s in the original graph which do not immediately backtrack (though they 
may eventually repeat nodes). 

The convergence and fixed points of loopy belief propagation may be consid- 
ered in terms of a Gibbs measure on the graph’s computation tree [182]. Loopy 
belief propagation is guaranteed to converge if the graph satisfies certain con- 
dition [182]. Fixed points of loopy belief propagation correspond to extrema of 
the so-called Bethe free energy [210]. Sufficient conditions for the uniqueness of 
loopy belief propagation fixed points are derived in [80]. 

Lazy propagation [110] is a scheme for modeling and exact belief update in 
the restricted class of mixture Bayesian networks known as conditional linear 
Gaussian Bayesian networks that is more general than Pearl’s scheme. The basic 
idea of lazy propagation is to instantiate potentials to reflect evidence and to 
postpone the combination of potentials until it becomes mandatory by a variable 
elimination operation. Lazy propagation yields a reduction in potential domain 
sizes and a possibility of avoiding some of the postponed potential combinations. 
In traditional message-passing schemes, a message consists of a single potential 
over the variables shared by the sender and receiver cliques. In lazy propagation, 
a message consists of a set of potentials. 


Factor graphs and the belief propagation algorithm 


A Bayesian network can easily be represented as a factor graph by introducing a 
factor for each variable, namely, the CPT of the variable given its parents in the 
Bayesian network. A Markov random field can be represented as a factor graph 
by taking the clique potentials as factors. Factor graphs naturally express the 
factorization structure of probability distributions; thus, they form a convenient 
representation for approximate inference algorithms that exploit this factoriza- 
tion. The exact solution to MAP inference in graphical models is well-known 
to be exponential in the size of the maximal cliques of the triangulated model, 
while approximate inference is typically exponential in the size of the model’s 
factors. 

A factor graph is an undirected graph consisting of nodes and edges. The 
factor nodes {F} and the variable nodes {X} are the two types of nodes in a 
factor graph. A factor node F represents a real-valued function and a variable 
node X represents a random variable. An edge E connecting a factor node F 
and a variable node X is an unordered pair E = (F, X) = (X, F). There is one 
edge E = (F, X) connecting a factor node F and a variable node X if and only 
if F is a function of X. There are no edges between any two factor nodes or any 
two variable nodes. The factor node F represents a pdf and the product of all 
factors equals the joint pdf of the probability model. 
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Figure 19.5 A Bayesian network of four variables. (a) Bayesian network. (b) Factor graph. 


Example 19.4: Consider an error-correcting code, expressed by a system of linear 
equations 

Xı +X2 + X3 =0 

Xə + X4 + X5 =0 

Xı + X4 + Xe =0 








where the variables take value of 0 or 1, and addition is modulo-2. By denoting 
the operation + as boxes and the variables as circles, the factor graph is as shown 
in Fig. 19.4. 


Example 19.5: Consider a Bayesian network with four variables X, Y, Z, W, as 
shown in Fig. 19.5a. The graph represents probability factorization: 


P(X, Y, Z, W) = p(Z|X, Y )p(Y |X, W)p(X|W)p(W). 


The corresponding factor graph is shown in Fig. 19.5b. 
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Any probability model can be represented by a factor graph by three steps: 


e Factorize the joint pdf of the probability model as a product of pdfs and 

conditional pdfs; 

Associate each random variable with a variable node and each pdf with a 

factor node; 

e Link one variable node X to one factor node F if the pdf F is a function of 
the random variable X. 


The belief propagation algorithm operates by passing messages along edges 
in a factor graph. Messages on an edge E = (F, X) are functions of the variable 
node X. Let yx—r be the message from X to F and yr_.x be the message 
from F to X. The message yx.p(X) (or yr.x(X)) can be interpreted as an 
approximation of the pdf on all the information that the variable node X (or the 
factor node F’) currently has except for that coming from the factor node F (or 
the variable node X). Let n(-) be the set of neighboring nodes of a node. The 
message update rules are summarized below: 


e The message yx—rp is updated by 


ime Tl deat, (19.15) 
Gen(X)\{F} 


where the constant cı makes yx_,p(X) a pdf. 
e The message yr_.x is updated by 


moxa (FO TT menja (1916) 
mx} Yen(f)\{X} 
where the constant cp makes yp_.x(X) a pdf, F(-) is the pdf associated with 
the factor node F, and Ja is the integration over all arguments of the 
integrand except X. 


The order of message propagation is not unique if messages are propagated 
through all edges in both directions in one iteration. The marginal pdf of X can 
be estimated by 


F(X)=c3 || ye+x(X) (19.17) 
Gen(X) 


where the constant cz makes F(X) a pdf. 

When the factor graph contains no loop, the belief propagation algorithm pro- 
duces exact marginal pdfs. However, when there are loops in the factor graph, 
belief propagation only performs approximate inference. Although it is difficult 
to prove its convergence, the loopy belief propagation algorithm is widely used in 
practice in view of its low computational complexity and giving good approxima- 
tion to the marginal pdfs in many cases. Applications of loopy belief propagation 
include the iterative decoding of Turbo codes and LDPC codes [53]. 
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The belief propagation-based sparse Bayesian learning algorithm discussed 
in [180] recovers sparse transform coefficients in large-scale compressed sensing 
problems. It is based on a hierarchical Bayesian model, which is turned into a 
factor graph. The algorithm has a computational complexity that is proportional 
to the number of transform coefficients. 

Stochastic belief propagation [141] is an adaptively randomized version of 
belief propagation, in which each node passes randomly chosen information to 
each of its neighbors. This significantly reduces the computational complexity 
from O(m?) to O(m) per iteration, and the communication complexity from 
m — 1 real numbers for standard belief propagation message updates to only 
logy m bits per edge and iteration, where m is the state dimension. 


Sampling (Monte Carlo) methods 


Markov chain Monte Carlo (MCMC) is an efficient approach in high dimensions. 
It is a general class of algorithms used for optimization, search and learning. The 
key idea is to build an ergodic Markov chain whose equilibrium distribution is 
the desired posterior distribution. MCMC is a sampling scheme for surveying a 
space S$ with a prescribed probability measure 7. It has particular importance in 
Bayesian analysis, where x € S represents a vector of parameters and m(x) is the 
posterior distribution of the parameters conditioned on the data. MCMC can 
also be used to solve the so-called missing data problem in frequentist statistics, 
where x € S represents the value of a latent or unobserved random variable, and 
m(x) is its distribution conditioned on the data. In either case, MCMC serves 
as a tool for numerical computation of complex integrals and is often the only 
workable approach for problems involving a large space with a complex structure. 

The numerical integration problem consists of calculating the value of the 
integral f f(a)m(x)dx. The MCMC approach consists in defining a Markov chain 
X+ with equilibrium distribution 7, and to use the theorem that 


T 
frere S F%) (19.18) 


for large T. In other words, for large number of samples the value of the integral 
is approximately equal to the average value of f (X+). 

Probably the most widely used version of MCMC is the Metropolis-Hastings 
method [75]. Metropolis-Hastings algorithms are a class of Markov chains that 
are commonly used to perform large-scale calculations and simulations in physics 
and statistics. Two common problems which are approached by Metropolis- 
Hastings algorithms are simulation and numerical integration. The Metropolis- 
Hastings method gives a solution to the following problem: Construct an ergodic 
Markov chain with states 1,2,...,N and with a prescribed stationary distri- 
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bution vector m. Constructing a Markov chain amounts to defining its state 
transition probabilities. 

Simulated annealing can be regarded as MCMC with a varying temperature. 
The approach consists in thinking of the function to be maximized as a proba- 
bility density, and running a Markov chain with this density as the equilibrium 
distribution. At the same time the chain is run, the density is gradually raised 
to higher powers (i.e. lowering the temperature). In the limit as the tempera- 
ture goes to zero, the successive densities turn into a collection of point masses 
located on the set of global maxima of the function. Thus when the temperature 
is very low, the corresponding Markov chain will spend most of its time near the 
global maxima. 

The stochastic approximation Monte Carlo algorithm [119] has a self-adjusting 
mechanism. If a proposal is rejected, the weight of the subregion that the current 
sample belongs to is adjusted to a larger value, and thus the proposal of jumping 
out from the current subregion is less likely to be rejected in the next iteration. 
Annealing stochastic approximation Monte Carlo [118] can be regarded as a 
space annealing version of stochastic approximation Monte Carlo. Under mild 
conditions, it can converge weakly at a rate of O(1/Vt) toward a neighboring 
set (in the space of energy) of the global minimizers. 

Importance sampling is a popular sampling tool used for Monte Carlo com- 
puting. It refers to a collection of Monte Carlo methods where a mathematical 
expectation with respect to a target distribution is approximated by a weighted 
average of random draws from another distribution. In importance sampling, we 
draw random samples æ; from the distribution on the hidden variables p(X), 
and then weight the samples by their likelihood p(y|x;), where y is the evidence. 
Together with MCMC methods, importance sampling has provided a foundation 
for simulation-based approaches to numerical integration since its introduction 
as a variance reduction technique in statistical physics [73]. 

Particle filtering [72], also known as the sequential Monte Carlo method, uses 
point mass, or particles, to represent the probability densities. The tracking 
problem can be expressed as a Bayes filtering problem, in which the posterior 
distribution of the target state is updated recursively as a new observation comes 
in. Sequential importance sampling is a version of the particle filters. All particle 
filters are derived based on the assumptions that the state transition is a first- 
order Markov process and that they use a number of particles to sequentially 
compute the expectation of any function of the state. Particle filtering is a well- 
established Bayesian Monte Carlo technique for estimating the current state of a 
hidden Markov process using a fixed number of samples. Compared to Gaussian- 
based Kalman filters, particle filters are able to represent a much broader class 
of distributions and impose much weaker constraints on the underlying models. 

Reversible-jump MCMC [71] is a framework for the construction of reversible 
Markov chain samplers that jump between parameter subspaces of differing 
dimensionality. It is essentially a random sweep Metropolis-Hastings method. 
This iterative algorithm does not depend on the initial state. At each step, a 


ww ai bbt.com DOOOO000 


19.5.1 


Probabilistic and Bayesian networks 611 


transition from the current state to a new state is accepted with a probabil- 
ity. This acceptance ratio is computed so that the detailed balance condition 
is satisfied, under which the algorithm converges to the measure of interest. 
A characteristic feature of the algorithm is that the proposition kernel can be 
decomposed into several kernels, each corresponding to a reversible move. In 
order to ensure the jump between different dimensions, the various moves used 
are the birth move, death move, split move, merge move, and perturb move, 
each selected with equal probability (0.2) [5]. In [5], simulated annealing with 
reversible-jump MCMC method is also proposed for optimization, with proved 
convergence. 


Gibbs sampling 


Introduced in [67], Gibbs sampling can be used to approximate any function 
of an initial joint distribution p(X) provided certain conditions are met. Given 
variables X = {X1,..., Xx} with some joint distribution p(X), we can use a 
Gibbs sampler to approximate the expectation of a function f(a) with respect 
to p(X). The Gibbs sampler is an iterative adaptive scheme. First, the Gibbs 
sampler proceeds by generating a value for the conditional distribution for each 
component of X, given the values of all other components of X. 

The Gibbs sampling implements iterations for a vector £ = (£1, ¥2,...,UK) at 
each time t as: 21(t) is drawn from the distribution of X1, given x2(t — 1), x3(t — 
1),...,@«K(t — 1), followed by implementations for x;, i = 2,..., K. Each com- 
ponent of the random vector is visited in the natural order. The new value of 
component x;, i = 1,2,..., K, i Æ k is immediately used for drawing a new value 
of Xf. 

Then, we sample a state for X; based on this probability distribution, and 
compute f(X). The two steps are iterated, keeping track of the average value of 
f(X). In the limit, as the number of cases approach infinity, this average is equal 
to E,(x)(f(X)) provided two conditions are met. First, the Gibbs sampler must 
be irreducible: p(X) must be such that we can eventually sample any possible 
configuration of X given any possible initial configuration of X. Second, each X; 
must be chosen infinitely often. In practice, an algorithm for deterministically 
rotating through the variables is typically used. Gibbs sampling is a special 
case of the general MCMC method for approximate inference [137]. Like the 
Metropolis-Hastings algorithm, the Gibbs sampler generates a Markov chain with 
the Gibbs distribution as the equilibrium distribution. However, the transition 
probabilities associated with the Gibbs sampler are nonstationary [67]. 

Under mild conditions, the following three theorems hold for Gibbs sampling 
[67], [65]. 


Theorem 19.2 (Convergence theorem). The random variable X(n) con- 


verges in distribution to the true probability distributions of X, for k= 
1,2,...,K as n approaches infinity. 
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Specifically, convergence of Gibbs sampling still holds under an arbitrary vis- 
iting scheme provided that this scheme does not depend on the values of the 
variables and that each component of X is visited on an infinitely-often basis 
[67]. 


Theorem 19.3 (Convergence of joint cumulative distribution). Assum- 
ing that the components of X are visited in the natural order, the joint cumulative 
distribution of the random variables X,(n), X2(n), ..., XK(n) converges to the 
true joint cumulative distribution of X1, X2,..., Xx at a geometric rate in n. 


Theorem 19.4 (Ergodic theorem). For any measurable function g of the 
random variables X1, X2,..., Xg whose expectation exists, we have 


lim ŽD Xi (i), X20), Xw(@) > Blo Xr, Xe, Xr), 


i=l 


with probability 1 (i.e., almost surely). 


The ergodic theorem tells us how to use the output of the Gibbs sampler to 
obtain numerical estimations of the desired marginal densities. 

Gibbs sampling is used in the Boltzmann machine to sample from distribution 
over hidden neurons. In the context of a stochastic machine using binary units 
(e.g., the Boltzmann machine), it is noteworthy that the Gibbs sampler is exactly 
the same as a variant of the Metropolis-Hastings algorithm. 


Variational Bayesian methods 


The mean-field methods such as variational Bayes, loopy belief propagation, 
expectation propagation and expectation consistent have aroused much interest 
because of their potential as approximate Bayesian inference engines [95], [143], 
[144], [210], [132], [145]. They represent a very attractive compromise between 
accuracy and computational complexity. The mean-field approximation exploits 
the law of large numbers to approximate large sums of random variables by 
their means. In many instances, the typical polynomial complexity makes the 
mean-field methods the only available Bayesian inference option. An undesirable 
property of the mean-field methods is that the approximation error is unattain- 
able. A message-passing interpretation of the mean-field approximation is given 
in [199]. 

Expectation consistent and expectation propagation are closely related to the 
adaptive Thouless-Anderson-Palmer mean-field theory framework [143], [144]. 
Expectation consistent is a formalization of the same underlying idea aiming 
at giving an approximation to the marginal likelihood. A set of complementary 
variational distributions, which share sufficient statistics, arise naturally in the 
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framework. Expectation propagation is an intuitively appealing scheme for tun- 
ing the variational distributions to achieve this expectation consistency. 

Variational Bayesian, so-called ensemble learning, methods provide an 
approach for the design of approximate inference algorithms [95], [192], [133], 
[85]. Variational Bayesian learning is an approximation to Bayesian learning. 
They are very flexible general modelling tools. Variational Bayesian is developed 
to approximate posterior density, with the objective of minimizing the misfit 
between them [193]. It can be used for both parametric and variational approxi- 
mation. In the parametric approximation, some parameters for the posterior pdf 
are optimized, while in variational approximation a whole function is optimized. 
Variational Bayesian learning can avoid overfitting, which is difficult to avoid if 
ML or MAP estimation is used. 

Variational Bayesian treatments of statistical models present significant advan- 
tages over ML-based alternatives: ML approaches have the undesirable property 
of being ill-posed since the likelihood function is unbounded from above [206]. 
The adoption of a Bayesian model inference algorithm, providing posterior dis- 
tributions over the model parameters instead of point-estimates, would allow for 
the natural resolution of these issues [206]. Another central issue that ML treat- 
ments of generative models are confronted with is the selection of the optimal 
model size. 

Variational Bayesian is an EM-like iterative procedure, and it is guaranteed to 
increase monotonically at each iteration. It alternately performs an E-step and 
an M-step. It requires only a modest amount of computational time, compared 
to that of EM. The BIC and MDL criteria for model selection are obtained 
from the variational Bayesian method in a large sample limit [8]. In this limit, 
variational Bayesian is equivalent to EM. Model selection in variational Bayesian 
is automatically accomplished by maximizing an estimation function. Variational 
Bayesian often shows a better generalization ability than EM when the number of 
data points is small. The online variational Bayesian algorithm [165] has proved 
convergence. It is a gradient method with the inverse of the Fisher information 
matrix for the posterior parameter distribution as a coefficient matrix [165], that 
is, the variational Bayesian method is a type of natural-gradient method. 

The stochastic complexity in variational Bayesian learning, which is an impor- 
tant quantity for selecting models, is also called the variational free energy and 
corresponds to a lower bound for the marginal likelihood or the Bayesian evidence 
[165]. Variational Bayesian learning of Gaussian mixture models is discussed 
in [195] and upper and lower bounds of variational stochastic complexities are 
derived. Variational Bayesian learning of mixture models of exponential families 
that include mixtures of distributions such as Gaussian, binomial and gamma is 
treated in [196]. A variational Bayesian algorithm for Student-t mixture models 
[6] is useful for constructing robust mixture models. 
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di h Fn-1 In 


Figure 19.6 DAG of a first-order HMM. q; and 0;, i = 1,..., N, are the hidden state variables and the 
observable variables, respectively. 


19.7 Hidden Markov models 


HMMs are probabilistic finite-state machines used to find structures in sequential 
data. An HMM is defined by a set of states, the transition probabilities between 
states, and a table of emission probabilities associated with each state for all 
possible symbols that occur in the sequence. This allows domain information to 
be built into its structure while allowing fine details to be learned from the data 
by adjusting the transition and emission probabilities. At each step, the system 
transits to another state and emits an observable quantity to a state-specific 
probability distribution. HMMs are very successful for speech recognition and 
gene prediction. 


Definition 19.5. An HMM is defined by: 
1) A set of N possible states Q = {q1, q2,---, qN}. 
2) A state transition matriz A = [aij], where ay = P(x = jlzt-1 = i), i,j = 








1,...,N, denotes the probability of making a transition from state qi to qj, 
em aij = 1Vi, and x, denotes the state at time t. 

3) A prior distribution over the state of the system at an initial time. 

4) A set of M possible outputs O = {01,02,..., OM}. 

5) A state-conditioned probability distribution over observations B = [bjx], where 
bjk = Ply: = klae = j) > 0,7 =1,...,N, k=1,...,M, denotes the observation 
probability for state qj over observations ox, DS bjk = 1,Vj, and y; denotes the 
output at time t. 


First-order HMM is a simple probability model, as shown in Fig. 19.6. There 
exist efficient algorithms (O(N)) for solving the inference and MAP problems. 
An HMM has one discrete hidden node and one discrete or continuous observed 
node per slice. If the Bayesian network is acyclic, one can use a local message- 
passing algorithm, which is a generalization of the forward-backward algorithm 
for HMMs. 

Denote a fixed length observation sequence by y = (y1, y2,---, Yn) and the cor- 
responding state sequence by Œ = (x1, £2,..., £n). HMM defines a joint proba- 
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bility distribution over observations as 
= 2, P(@)P(ule) 


= Sper (x2|21) +++ P(@n|en-1)P(yi|z1)P(yo|z2) +» P(Yn|an) 


n 
= > P(21)P(yi|a1) | | P(wiles—1)P(yilei), (19.19) 
1=2 
where P(a;|x;-1) is specified by the state transition matrix, and P(y;|x;) is 
specified by the model. 

The standard method of estimating the parameters of HMM is the EM algo- 
rithm [153], [52]. Learning consists of adjusting the model parameters to max- 
imize the likelihood of a given training sequence yo_,p. The EM procedure is 
guaranteed to converge to a local ML estimate for the model parameters under 
very general conditions. Unfortunately, EM is strongly dependent on the selection 
of the initial values of the model parameters. The simulated annealing version 
of EM [90] combines simulated annealing with EM by reformulating the HMM 
estimation process using a stochastic step between the EM steps and simulated 
annealing. In contrast to HMMs, Kalman filters are relevant when the hidden 
states are described by continuous variables. An online EM algorithm for HMMs, 
which does not require the storage of the inputs by rewriting the EM update in 
terms of sufficient statistics updated recursively, is given in [134]. This scheme 
is generalized to the case where the model parameters can change with time by 
introducing a discount factor into the recurrence relations. The resulting algo- 
rithm is equivalent to the batch EM algorithm, for appropriate discount factor 
and scheduling of parameters update [134]. 

For HMMs, the EM algorithm results in reestimating the model parameters 
according to the Baum-Welch formulas: 


i 
atd — Xe Plea = i t = jlyonr On) (19.20) 
a 5S P(x = Yor, On) 
T . ĝ 
hoD _ Divan Pla: = j, w = klyo~r: On) (19.21) 
A p ~ ? 
7 Si P(x =J Yosr, On) 


where the probabilities on the right-hand side are conditioned on the training 
sequence Yo_, and on the current parameters’ estimate 6, = (ay, ib). They 
can be efficiently computed using the forward-backward procedure. This, how- 
ever, requires storing the whole training sequence. Baum-Welch is an iterative 
EM technique specialized for batch learning of HMM parameters via ML estima- 
tion to best fit the observed data, used to estimate the transition matrix P of an 
HMM. Traditionally, the Baum-Welch algorithm is used to infer the state transi- 
tion matrix of a Markov chain and symbol output probabilities associated to the 
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states of the chain, given an initial Markov model and a sequence of symbolic 
output values [153]. 

The forward-backward algorithm [28], [13] is a dynamic programming tech- 
nique that forms the basis for estimation of HMM parameters using the Baum- 
Welch technique. Given a finite sequence of training data, it efficiently evaluates 
the likelihood of this data given an HMM, and computes the smoothed condi- 
tional state probability densities for updating HMM parameters according to the 
Baum-Welch algorithm. 

Despite suffering from numerical instability, the forward-backward algorithm 
is more famous than the forward filtering backward smoothing algorithm [57]. 
The latter algorithm [147], [154] is probabilistically more meaningful since it 
propagates probability densities in its forward and backward passes. An efficient 
version of the algorithm [99] reduces the memory complexity without the compu- 
tational overhead. Efficient forward filtering backward smoothing generates the 
same results and the same time complexity O(N?T) for an HMM with N states 
and an observation sequence of length T, but reducing the memory complexity 
of O(NT) for forward-backward to O(N). 

The Student-t HMM is a robust form of conventional continuous density 
HMMs, trained by means of the EM algorithm. A variational Bayesian inference 
algorithm [30] for this model yields the variational Bayesian Student-t HMM. 
The approach provides an efficient and more robust alternative to EM-based 
methods, tackling their singularity and overfitting proneness, while allowing for 
the automatic determination of the optimal model size without crossvalidation. 

The Viterbi algorithm is a dynamical programming method that uses the 
Viterbi path to discover the single most likely explanation for the observations. 
The evidence propagation algorithm [111], [93] is the inference algorithm for 
directed probabilistic independent networks. A closely related algorithm [50] 
solves the MAP identification problem with the same time complexity as the 
evidence propagation inference algorithm. The two algorithms are strict general- 
izations of the forward-backward and Viterbi algorithms for HMM, respectively 
[172]. 

Two common categories of algorithms for learning Markov network structure 
from data are score-based [152] and independence- or constraint-based [174], [22] 
algorithms. An efficient method for incrementally inducing features of Markov 
random fields is suggested in [152]. However, evaluation of these scores has been 
proved to be NP-hard for undirected models. A class of efficient algorithms for 
structure and parameter learning of factor graphs that subsume Markov and 
Bayesian networks are introduced in [1]. 

GSMN (grow-shrink Markov network) and GSIMN (grow-shrink inference 
Markov network) are two independence-based structure learning algorithms of 
Markov networks [22]. GSMN is an adaptation to Markov networks of the grow- 
shrink algorithm for learning the structure of Bayesian networks [129]. GSIMN 
is nearly optimal in terms of the number of tests it can infer, under a fixed order- 
ing of the tests performed. Dynamic GSIMN learning [63] improves on GSIMN 
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by dynamically selecting the locally optimal test that will increase the state of 
knowledge about the structure the most. Both GSIMN and dynamic GSIMN 
extend and improve GSMN by additionally exploiting Pearl’s theorems on the 
properties of conditional independence relation to infer additional dependencies 
and independences from the set of already known ones resulting from statistical 
tests and previous inferences, thus avoiding the execution of these tests on data 
and therefore speeding up the structure learning process. 


Dynamic Bayesian networks 


Dynamic Bayesian networks [101] are standard extensions of Bayesian networks 
to temporal processes. They generalize HMMs and Kalman filters. Dynamic 
Bayesian networks model a dynamic system by discretizing time and providing 
a Bayesian network fragment that represents the probabilistic transition of the 
state at time t to the state at time t + 1. The state of the system is represented at 
different points in time implicitly. It is very difficult to query a dynamic Bayesian 
network for a distribution over the time at which a particular event takes place. 
Moreover, since dynamic Bayesian networks slice time into fixed increments, one 
must always propagate the joint distribution over the variables at the same rate. 

Dynamic Bayesian networks generalize HMMs by representing the hidden state 
(and observed states) in terms of state variables related in a DAG. This effec- 
tively reduces the number of parameters to be specified. A dynamic Bayesian 
network can be converted into an HMM. In a dynamic Bayesian network case, a 
probabilistic network models a system as it evolves over time. Dynamic Bayesian 
networks are also time-invariant since the topology of the network is a repeat- 
ing structure and the CPTs do not change over time. Continuous-time Markov 
networks are the undirected counterparts of continuous-time Bayesian networks. 

Since dynamic Bayesian networks are only a subclass of Bayesian networks, 
the structure-based algorithms developed for Bayesian networks can be imme- 
diately applied to reasoning with dynamic Bayesian networks. Constant-space 
algorithms for dynamic Bayesian networks [48] are efficient algorithms whose 
space complexity is independent of the time span T. 

Continuous-time Bayesian networks [140] are based on the framework of con- 
tinuous time, finite state, homogeneous Markov processes. Exact inference in a 
continuous-time Bayesian network can be performed by generating a single joint 
intensity matrix over the entire state space of the continuous-time Bayesian net- 
work and running the forward-backward algorithm on the joint intensity matrix 
of the homogeneous Markov process. Inference in such models is intractable even 
in relatively simple structured networks. In a mean field variational approxima- 
tion [41], a product of inhomogeneous Markov processes is used to approximate 
a joint distribution over trajectories. Additionally, it provides a lower bound on 
the probability of observations, thus making it attractive for learning tasks. 
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Importance sampling based approximate inference [59] does not require com- 
puting the exact posterior distribution. It is extended to continuous-time par- 
ticle filtering and smoothing algorithms. These three algorithms can estimate 
the expectation of any function of a trajectory, conditioned on any evidence set 
constraining the values of subsets of the variables over subsets of the time line. 
Compared to approximate inference algorithms based on expectation propaga- 
tion and Gibbs sampling, the importance sampling algorithm outperforms both 
in most of the experiments. In the situation of a highly deterministic system, 
Gibbs sampling performs better. 

The continuous-time Bayesian network reasoning and learning engine (CTBN- 
RLE) [167] software provides C++ libraries and programs for most of the algo- 
rithms developed for continuous-time Bayesian networks. Exact inference as well 
as approximate inference methods including Gibbs sampling [54] and importance 
sampling [59] are implemented. In an MCMC procedure [54], a Gibbs sampler 
is used to generate samples from the posterior distribution given the evidence. 
The Gibbs sampling algorithm can handle any type of evidence. 


Expectation-maximization algorithm 


The EM method [52] is the most popular optimization approach to the exact 
ML solution for the parameters given an incomplete data. EM splits a complex 
learning problem into a group of separate small-scale subproblems and solves 
each of the subproblems using a simple method, and is thus computationally 
efficient. It is a technique for finding a local ML or MAP. Even though EM is 
generally considered to converge linearly, it has a significantly faster convergence 
rate than gradient descent. EM can be viewed as a deterministic version of Gibbs 
sampling, and can be used to search for the MAP estimate of model parameters. 
A message-passing interpretation for EM is available in [49]. 

For a large amount of data, p(@s|D, S”) x p(D|@s, 5") -p(@s|S") can be 
approximated as a multivariate-Gaussian distribution [77]. Define 


As = argmaxlog{p(D|6s, S") - p(8s|$")}. (19.22) 
S 


Thus, 05 also maximizes p(0s|D, S”), and the solution is known as the MAP 
solution of ðs. 

As the sample size increases, the effect of the prior p(@s|S”) diminishes, thus 
MAP reduces to the ML configuration 


65 = arg max{p(D|Os, ihe (19.23) 

S 
The explicit determination of ML estimates for the conditional probabilities of 
the nodes of a discrete Bayesian network using incomplete data is usually not 


possible, and iterative methods such as EM or Gibbs sampling are normally 
required. 
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The EM method alternates between performing the expectation step (E-step) 
and the maximization step (M-step), which, respectively, compute the expected 
values of the latent variables and the ML or MAP estimates of the parameters. 
Os is initialized at random. In the E-step, the expected sufficient statistics for 
the missing entries of the database D are computed conditioned on the assigned 
configuration of Os and the known data D. In the M-step, the expected sufti- 
cient statistics are taken as though they were the actual sufficient statistics of 
a database D’, and the mean of the parameters Og is calculated such that the 
probability of observation of D’ is maximized. 

Under certain regularity conditions, iteration of the E- and M-steps will con- 
verge to a local maximum [52]. The EM algorithm is typically applied when 
sufficient statistics exist, i.e., when local distribution functions are in the expo- 
nential family. It is fast, but it does not provide a distribution over the parameters 
0. A switch can be made to alternative algorithms when near a solution in order 
to overcome the slow convergence of EM when near local maxima [131]. 

The asymptotic convergence rate of the EM algorithm for Gaussian mixtures 
locally around the true solution is given in [127]. The large sample local conver- 
gence rate for the EM algorithm tends to be asymptotically superlinear when 
the measure of the average overlap of Gaussians in the mixture tends to zero. 

Singularities in the parameter spaces of hierarchical learning machines are 
known to be a main cause of slow convergence of gradient-descent learning. 
EM is a good alternative to overcome the slow learning speed of the gradient- 
descent method. The slow convergence of the EM method in the case of large 
component overlap is a widely known phenomenon. The dynamics of EM for 
Gaussian mixtures around singularities is analyzed in [148]; there exists a slow 
manifold caused by a singular structure [148], which is closely related to the slow 
convergence of the EM algorithm. In the case of the mixture of densities from 
exponential families, the convergence speed depends on the separation of the 
component populations in the mixture. When the component populations are 
poorly separated, the convergence speed of EM becomes extraordinarily slow. 

The noisy EM theorem states that a suitably noisy EM algorithm estimates the 
EM estimate in fewer steps on average than does the corresponding noiseless EM 
algorithm [146]. Many centroid-based clustering algorithms including C-means 
benefits from noise because they are special cases of the EM algorithm. 

Another general iterative algorithm for learning parameters of statistical mod- 
els is a Gibbs sampler called data augmentation [181]. Data augmentation is 
quite similar to EM, but instead of calculating the expected values of the suf- 
ficient statistics, a value is drawn from a predictive distribution and imputed 
(I-step). Similarly, instead of calculating the ML-estimates, a parameter value 
is drawn from the posterior distribution on the parameter space conditioned on 
the most recent fully imputed data sample (P-step). Based on MCMC theory 
this will in the limit return parameter realizations from the posterior parameter 
distribution conditioned on the observed data. 
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The EM method has been applied to obtain ML estimates of the network 
parameters for feedforward network learning. In [126], the training of the three- 
layer MLP is first decomposed into a set of single neurons, and the individual 
neurons are then trained via a linearly weighted regression algorithm. The EM 
method has also been applied for RBF network learning [112, 107]. For classifi- 
cation using the RBF network [207], the Bayesian method is applied to explore 
the weight structure of RBF networks, and an EM algorithm is used to estimate 
the weights. 


Mixture models 


Mixture models provide a rigorous framework for density estimation and clus- 
tering. Mixtures of Gaussians are widely used throughout the fields of machine 
learning and statistics for data modeling. 

The mixture model with K components is given by 


K 
p(w,v,0) = X vith(w, bi), (19.24) 
i=1 
where v = (v1, V2,...,UK)! with v; being a mixing coefficient for the ith com- 


ponent, 0 = (01, 02,...,9x)" with 6; being a parameter associated with the ith 
component, and y(a) is a basic pdf. 
A latent variable model seeks to relate a n-dimensional measured vector y to 


a corresponding m-dimensional vector of latent variables x [187]: 
y=f(rz,w)+n, (19.25) 


where f is a function of the latent variables x and parameters w, and n is a noise 
process. Generally, one selects m < n in order to ensure that the latent variables 
offer a more parsimonious description of the data. The model parameters can be 
determined by the ML method. 

The task of estimating the parameters of a given mixture can be achieved 
with different approaches: ML, MAP or Bayesian inference. ML estimates the 
parameters by means of the maximization of a likelihood function. One of the 
most common methods for fitting mixtures to data is EM. However, EM is unable 
to make model selection, i.e., to determine the appropriate number of model 
components in a density mixture. The same is true for the MAP estimation 
approach which tries to find the parameters that correspond to the location of 
the MAP density function. 

Competitive EM [213], analogous to the SMEM algorithm [190], utilizes a 
heuristic split-and-merge mechanism to either split the model components in 
an underpopulated region or merge the components in an overpopulated region 
iteratively so as to avoid local solutions. It also exploits a heuristic component 
annihilation mechanism to determine the number of model components. In [15], 
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the entropy of the pdf associated to each kernel is used to measure the quality of 
a given mixture model with a fixed number of kernels. Entropy-based EM starts 
with a unique kernel and performs only splitting by selecting the worst kernel in 
a global mixture entropy-based criterion called Gaussianity deficiency in order 
to find the optimum number of components of the mixture. 

Another way is to implement parameter estimation and model selection jointly 
in a single paradigm by using MCMC or variational methods [69], [136]. Varia- 
tional algorithms are guaranteed to provide a lower bound of the approximation 
error [69]. Reversible jump MCMC has been applied to the Gaussian mixture 
model [155]. A Bayesian method for mixture model training [42] simultaneously 
treats feature selection and model selection. The algorithm follows the variational 
framework and can simultaneously optimize over the number of components, the 
saliency of the features, and the parameters of the mixture model. 

The Gaussian mixture model can be thought of as a prototype method, similar 
in spirit to C-means and LVQ. Each cluster is described in terms of a Gaussian 
density, which has a centroid (as in C-means) and a covariance matrix. The 
two alternating steps of the EM algorithm are very similar to the two steps in 
C-means. The Gaussian mixture model is often referred to as a soft-clustering 
method, while C-means is hard. 

When the data are not Gaussian, mixtures of generalized Dirichlet distribu- 
tions may be adopted as a good alternative [18]. In fact, the generalized Dirichlet 
distribution is more appropriate for modeling data that are compactly supported, 
such as data originating from videos, images, or text. The conditional indepen- 
dence assumption among features commonly used in modeling high-dimensional 
data becomes a fact for generalized Dirichlet data sets without loss of accuracy. 
In [19], an unsupervised approach is presented for extraction of independent and 
non-Gaussian features in mixtures of generalized Dirichlet distributions. The 
proposed model is learned using the EM algorithm by minimizing the message 
length of the data set. 


Probabilistic PCA 


Probabilistic model-based PCA methods consider the linear generative model, 
and derive iterative algorithms by using EM within an ML framework for Gaus- 
sian data. These are batch algorithms that find principal subspace. A probabilis- 
tic formulation of PCA provides a good foundation for handling missing values. 
However, these methods conduct PSA rather than PCA. Two examples of prob- 
abilistic model-based PCA methods are probabilistic PCA [186] and EM-PCA 
[159]. 

The starting point for probabilistic PCA [187] is a factor analysis style latent 
variable model (12.25). The EM algorithm for PCA described in [159] is com- 
putationally very efficient in space and time; it does not require computing the 
sample covariance and has a complexity limited by O(knp) operations for k 
leading eigenvectors to be learned, p dimensions and n data. EM-ePCA [3] finds 
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exact principal directions without rotational ambiguity by the minimization of 
an integrated-squared error measure for derivation of PCA. In addition, GHA 
can be derived using the gradient-descent method by minimizing the integrated- 
squared error. A number of Bayesian formulations of PCA have followed from the 
probabilistic formulation of PCA [186], with the necessary marginalization being 
approximated through both Laplace approximations and variational bounds. 

The probabilistic latent variable model of [162] performs linear dimension 
reduction on data sets which contain clusters. The ML solution of the model 
is an unsupervised generalization of LDA. The proposed solution is a generative 
model which is made up of two orthogonal latent variable models: One is an 
unconstrained, probabilistic PCA-like model, while the other has a mixture of 
Gaussians prior on the latent variables. 

Dual probabilistic PCA [113] is a probabilistic interpretation of PCA. It has 
an additional advantage that the linear mappings from the embedded space can 
easily be nonlinearized through Gaussian processes. This model is referred to as 
a Gaussian process latent variable model. The model is related to kernel PCA 
and multidimensional scaling. 

Probabilistic PCA has been extended to its mixture for data sampled from 
multivariate t-distributions to combine both robustness and dimension reduction 
[215]. Mixtures of probabilistic PCA model high-dimensional nonlinear data by 
combining local linear models. Each mixture component is specifically designed 
to extract the local principal orientations in the data. The mixtures of robust 
probabilistic PCA modules are introduced by means of the Student-t distribution 
[7]. 

Exponential PCA [197] reduces the dimension of the parameters of probability 
distributions using Kullback information as a distance between two distributions. 
It also provides a framework for dealing with various data types such as binary 
and integer for which the Gaussian assumption on the data distribution is inap- 
propriate. A learning algorithm for those mixture models based on the variational 
Bayesian method is derived. 

In case of high-dimensional and very sparse data, overfitting becomes a severe 
problem and traditional algorithms for PCA are very slow. A fast algorithm is 
introduced and extended to variational Bayesian learning [91]. 

Bilinear probabilistic PCA is a probabilistic PCA model on 2-D data [216]. A 
probabilistic model for GLRAM is formulated as a two-dimensional PCA method 
[204]. 


Probabilistic clustering 


Model-based clustering approaches are usually based on mixture-likelihood and 
classification-likelihood. EM and classification EM [25] are the corresponding 
examples of these approaches. They are very sensitive to the initial conditions of 
the model’s parameters. A deterministic annealing EM algorithm [189] tackles 
the initialization issue via a deterministic annealing process. 
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EM clustering [20] represents each cluster using a probability distribution, 
typically a Gaussian distribution. Each cluster is represented by a mean and 
a J, x Jı covariance matrix. Each pattern belongs to all the clusters with the 
probabilities of membership determined by the distributions of the corresponding 
clusters. Thus, EM clustering can be treated as a fuzzy clustering technique. C- 
means is equivalent to classification EM corresponding to the uniform spherical 
Gaussian model [25]. C-means is similar to EM in two alternating steps. In the 
E-step, it computes C distances per point x;, find the nearest centroid cj to 
zi, and update c; (sufficient statistics). In the M-step, it redistributes patterns 
among the clusters using error criterion. 

The learning process of a probabilistic SOM is considered as a model-based 
data clustering procedure that preserves the topological relationships between 
data clusters [33]. A coupling-likelihood mixture model extends the reference vec- 
tors in SOM to multivariate Gaussian distributions, and three EM-type learning 
algorithms are given in [33]. In [122], SOM is extended by performing a proba- 
bilistic PCA at each neuron. The approach has a low complexity on the input 
dimension. This allows the processing of very high-dimensional data to obtain 
reliable estimations of the probability densities which are based on the proba- 
bilistic PCA framework. In [123], the probabilistic SOM is derived from a proba- 
bilistic mixture of multivariate Student-t components to improve the robustness 
of the map against outliers. 

Variational Bayesian clustering algorithms utilize an evaluation function that 
can be described as the log-likelihood of given data minus the Kullback-Leibler 
divergence between the prior and the posterior of model parameters [178]. The 
update process of variational Bayesian clustering with finite mixture Student-t 
distribution is derived, taking the penalty term for the degree of freedom into 
account. 

Maximum weighted likelihood [34] learns the model parameters via maximiz- 
ing a weighted likelihood. It provides a general learning paradigm for density- 
mixture model selection and learning, in which weight design is a key issue. A 
rival penalized EM algorithm for density mixture clustering makes the compo- 
nents in a density mixture compete with one another at each time step. Not 
only are the associated parameters of the winner updated to adapt to an input, 
but also all rivals’ parameters are penalized with a strength proportional to the 
corresponding posterior density probabilities. Rival penalized EM can automat- 
ically select an appropriate number of densities by fading out the redundant 
densities during the learning process. A simplified rival penalized EM is applica- 
ble to elliptical clusters as well with any input proportion. Compared to RPCL 
and its variants, this method avoids the preselection of the delearning rate. The 
proposed stochastic version of RPCL also circumvents the selection of delearning 
rate. A heuristic extended EM algorithm [214] performs model selection by fad- 
ing the redundant components out from a density mixture, meanwhile estimating 
the model parameters appropriately. 
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Probabilistic ICA 


Bayesian ICA algorithms [9], [133], [85], [27] offer accurate estimations for the 
linear model parameters. For instance, universal density approximation using 
a mixture of Gaussians may be used for the source distributions. In Bayesian 
ICA, however, the source distributions are modelled explicitly, using a mixture 
of Gaussians model. Variational Bayesian nonlinear ICA [108], [192] uses the 
MLP to model the nonlinear mixing transformation. 

ICA and PCA are closely related to factor analysis. Factor analysis uses the 
Gaussian model and is not suitable for BSS, and the ML estimate of the mixing 
matrix is not unique. Independent factor analysis [9] recovers independent hidden 
sources from their observed mixtures. It generalizes and unifies factor analysis, 
PCA and ICA. The source densities, mixing matrix and noise covariance are first 
estimated from the observed data by ML using an EM algorithm. Each source is 
described by a mixture of Gaussians; thus, all the probabilistic calculations can 
be performed analytically. The sources are then reconstructed from the observed 
data by an optimal nonlinear estimator. A variational approximation of this 
algorithm is derived for cases with a large number of sources. The complexity of 
independent factor analysis grows exponentially in the number of sources. 

ICA can be performed using a constrained version of EM [198]. The source dis- 
tributions are modeled as d one-dimensional mixtures of Gaussians. The observed 
data are modeled as linear mixtures of the sources with additive, isotropic noise. 
The EM algorithm allows factoring the posterior density. This avoids an expo- 
nential increase of complexity with the number of sources and allows an exact 
treatment in the case of many sources. 

Variational Bayesian applied to ICA can handle small data sets with high 
observed dimension. This method can perform model selection and avoid over- 
fitting. The Gaussian analyzers mixture model is extended to an ICA mixture 
model in [39]: Variational Bayesian inference and structure determination are 
employed to construct an approach for modeling nongaussian, discontinuous 
manifolds; it automatically determines the local dimensions of each manifold and 
uses variational inference to calculate the optimum number of ICA components. 
In [27], variational Bayesian ICA is extended to problems with high-dimensional 
data containing missing entries. The complexity of the method is O(NK%), for 
N data points, L hidden sources assumed, and K one-dimensional gaussians used 
to model the density of each source. 

In mean-field approaches to probabilistic ICA [85], the sources are estimated 
from the mean of their posterior distribution and the mixing matrix is estimated 
by MAP. The latter requires the computation of the correlations between sources. 
Three mean-field methods are the variational approach, linear response correc- 
tions, and the adaptive Thouless-Anderson-Palmer mean-field theory [144]. For 
the mean-field ICA methods [85], the flexibility with respect to the prior makes 
the inside of the black-box rather complicated; convergence of EM-based learn- 
ing is slow and no universal stopping criteria can be given. icaMF [200] solves 


ww ai bbt.com DOOOO000 


19.11 


Probabilistic and Bayesian networks 625 


these problems by using efficient optimization schemes of EM. The expectation 
consistent framework and expectation propagation message-passing algorithm 
are applied to the ICA model. The method is flexible with respect to choice 
of source prior, dimensionality and constraints of the mixing matrix (uncon- 
strained or nonnegativity), and structure of the noise covariance matrix. The 
required expectations over the source posterior are estimated with mean-field 
methods. 

In [217], a parametric density model, which can be suitable for separating 
super-gaussian sources and sub-gaussian sources when setting various parame- 
ters into the density model, is employed for BSS. Transforming the intractable 
posterior into a gaussian form by representing the prior density as a variational 
optimization problem, a variational EM algorithm is then derived for BSS. Vari- 
ational EM can perform blind separation of more sources than mixtures. 

The Bayesian nonstationary source separation algorithm of [87] recovers non- 
stationary sources from noisy mixtures. In order to exploit the temporal structure 
of the data, a time-varying autoregressive process is used to model each source 
signal. Variational Bayesian learning is then adopted to integrate the source 
model with BSS in probabilistic form. The separation algorithm makes full use 
of temporally correlated prior information and avoids overfitting in the separa- 
tion process. Variational EM steps are applied to derive approximate posteriors 
and a set of update rules for the parameters of these posteriors. 

The convergence speed of FastICA is analyzed in [191]. The analysis suggests 
that the nonlinearity used in FastICA can be interpreted as denoising and taking 
Bayesian noise filtering as the nonlinearity resulted from fast Bayesian ICA. This 
denoising interpretation is generalized in [163] and a source separation frame- 
work called denoising source separation is introduced. The algorithms differ in 
the denoising function while the other parts remain mostly the same. Some 
existing ICA algorithms are reinterpreted within the denoising source separation 
framework and some robust BSS algorithms are suggested. 

In real-world applications, the mixing system and source signals in ICA may 
be nonstationary. In [38], a separation procedure is established in the presence of 
nonstationary and temporally correlated mixing coefficients and source signals. 
In this procedure, the evolved statistics are captured from sequential signals 
according to online Bayesian learning for the Gaussian process. A variational 
Bayesian inference is developed to approximate the true posterior for estimating 
the nonstationary ICA parameters and for characterizing the activity of latent 
sources. 


Bayesian approach to neural network learning 
The Bayes learning rule is the optimal perceptron learning rule that gives rise 


to a lower bound on the generalization error [56]. An important approximation 
to the Bayesian method is proposed and analyzed in [142] for online learning 
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on perceptrons and relies on a projection of the posterior probabilities of the 
parameters to be estimated on a space of tractable distributions minimizing the 
Kullback-Leibler divergence between both. A variational algorithm for single- 
layer perceptron learning by a Hebbian rule is given in [100]. 

In a Bayesian approach to RBF networks [86], the model complexity can be 
adjusted according to the complexity of the problem using an MCMC sampler. 
A hierarchical full Bayesian model for RBF networks [5] treats the model size, 
model parameters, regularization parameters, and noise parameters as unknown 
random variables. Simulated annealing with reversible-jump MCMC method 
is used to perform the Bayesian computation. A principled Bayesian learning 
approach to neural networks can lead to many improvements [128]. In particular, 
by approximating the distributions of the weights with Gaussians and adopting 
smoothing priors, it is possible to obtain estimates of the weights and output 
variances as well as to set the regularization coefficients automatically [128]. 

The EM algorithm for normalized Gaussian RBF network is derived in [205], 
from which an online EM algorithm [164] is derived by introducing a discount 
factor. Online EM is equivalent to batch EM if a specific scheduling of the dis- 
count factor is employed. 

A Bayesian interpretation of LDA is considered in [26]. With the use of a 
Gaussian process prior, the model is shown to be equivalent to a regularized 
kernel Fisher’s discriminant. In [117], a Bayesian model family is developed as 
a unified probabilistic formulation of the latent variable models such as factor 
analysis and PCA. It employs exponential family distributions to specify various 
types of factors and a Gibbs sampling procedure as a general computation rou- 
tine. An EM approach to kernel PCA [158] is a computationally efficient method, 
especially for a large number of data points. 

Under the framework of MAP probability, two-dimensional NMF [64] is adap- 
tively tuned using the variational approach. The method enables a generalized 
criterion for variable sparseness to be imposed onto the solution, and prior infor- 
mation to be explicitly incorporated into the basis features. 

A Bayesian approach toward echo state networks, called the echo state Gaus- 
sian process, combines the merits of echo state networks and Gaussian processes 
to provide a more robust solution to reservoir computing while offering a measure 
of confidence on the generated predictions [29]. In [168], a variational Bayesian 
framework combined with automatic regularization and delay-and-sum readout 
adaptation is proposed for efficient training of echo state networks. 

The kernel-based Bayesian network paradigm is introduced for supervised clas- 
sification [150]. This paradigm is a Bayesian network which estimates the true 
density of the continuous variables using kernels. It uses a non-parametric kernel- 
based density estimation instead of a parametric Gaussian one. 

Using EM and variational approximation methods, the variational Bayesian LS 
approach [185] offers a computationally efficient and statistically robust black- 
box approach to generalized linear regression with high-dimensional inputs. The 
framework of sparse Bayesian learning, the relevance vector machine, is derived 
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with variational Bayesian LS at its core. The iterative nature of variational 
Bayesian LS makes it most suitable for real-time incremental learning. 

Quasi-Newton algorithms can be interpreted as approximations to Bayesian 
linear regression under Gaussian and other priors [79]. This analysis gives rise 
to a class of Bayesian nonparametric quasi-Newton algorithms, which have a 
computational cost similar to its predecessors [79]. These use a kernel model to 
learn from all observations in each line-search, explicitly track uncertainty, and 
thus achieve faster convergence towards the true Hessian. 

Affinity propagation [60] is an unsupervised learning algorithm for exemplar- 
based clustering. It associates each data point with one exemplar, resulting in 
a partitioning of the whole data set into clusters by minimizing the overall sum 
of similarities between data points and their exemplars. Real-valued messages 
are exchanged between data points until a high-quality set of exemplars and 
corresponding clusters gradually emerges. Affinity propagation has been derived 
as an instance of the max-product (belief-propagation) algorithm in a loopy 
factor graph [103], [149]. 

Sparse coding is modeled in a Bayesian network framework in [166], where 
sparsity-favouring Laplace prior is applied on the coefficients of the linear model 
and expectation propagation inference is employed. In [124], sparse coding is 
interpreted from a Bayesian perspective, which results in an objective function. 
Through MAP estimation, the obtained solution can have smaller reconstruction 
errors than that obtained by standard method using Lı regularization. 


Boltzmann machines 


The Boltzmann machine as well as some other stochastic models can be treated 
as generalizations of the Hopfield model. The Boltzmann machine integrates 
the global optimization capability of simulated annealing. By a combination of 
the concept of energy and the neural network topology, these models provide a 
method to deal with notorious COPs. 

The Boltzmann machine is a stochastic recurrent network based on physical 
systems [2, 81]. It has the same network architecture as that of the Hopfield 
model, that is, it is highly recurrent with wij = wji and wi = 0, i,j =1,...,J. 
Unlike the Hopfield network, the Boltzmann machine can have hidden units. 
The Hopfield network operates in an unsupervised manner, while the Boltzmann 
machine can be trained in an unsupervised or supervised manner. 

Neurons of a Boltzmann machine are divided into visible and hidden units, as 
illustrated in Fig. 19.7. In Fig. 19.7a, the visible units are clamped onto specific 
states determined by the environment, while the hidden units always operate 
freely. By capturing high-order statistical correlations in the clamping vector, the 
hidden units simulate the underlying constraints contained in the input vectors. 
This type of Boltzmann machine uses unsupervised learning, and can perform 
pattern completion. When the visible units are further divided into input and 
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layer layer layer layer 
(a) (b) 


Figure 19.7 Architecture of Boltzmann machine. (a) An architecture with visible and hidden neurons. 
(b) An architecture with input, output, and hidden neurons. 


output neurons, as shown in Fig. 19.7b, this type of Boltzmann machine can be 
trained in a supervised manner. The recurrence eliminates the difference in input 
and output cells. The Boltzmann machine, operated in sequential or synchronous 
mode, is a universal approximator for arbitrary functions defined on finite sets 
[211]. 

Instead of using a sigmoidal function in the Hopfield network, the activation 
at each neuron takes either 0 or 1, depending on the probability of a temperature 





variable T 
J 
neti = 5 Wii Lj = Wit, (19.26) 
j=l ati 
= f1, with probability P; 
SE l with probability 1 — P, ’ (19.27) 
1 
l+e Tt 


When T is very large or net; approaches 0, x; takes either 1 or 0 with equal 
probability. For very small T, x; is deterministically 1. The input and output 
states can be fixed or variable. 

Search for all the possible states is performed at temperature T in the Boltz- 
mann machine. At the steady state, the relative probability of two states in the 
Boltzmann machine is determined by the Boltzmann distribution of the energy 
difference between the two states 

E E ean (19.29) 
where Ea and Eg are the corresponding energy levels of the two states. The 
energy can be computed by the same formula as for the Hopfield model 


E(t) = -iT (Walt) = — Y zilte; (t)wiy. (19.30) 
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In [10], a synchronous Boltzmann machines as well as its learning algorithm has 
been introduced to facilitate parallel implementations. 

Like the complex-valued multistate Hopfield model, a multivalued Boltzmann 
machine proposed in [120] extends the binary Boltzmann machine. Each neuron 
of the multivalued Boltzmann machine can only take L discrete stable states, and 
the angle between two adjacent directions is given by yo = 2r, The probability of 
state change is determined by the Boltzmann distribution of the energy difference 
between the two states. 


Boltzmann learning algorithm 


When using the Boltzmann machine with hidden units, the generalized Hebbian 
rule cannot be used in an unsupervised manner. For supervised learning of the 
Boltzmann machine, BP is not applicable due to the different network archi- 
tecture. Simulated annealing is used by Boltzmann machines to learn weights 
corresponding to the global optimum. The learning process in the Boltzmann 
machine is computationally very expensive. The learning complexity of the exact 
Boltzmann machine is exponential in the number of neurons [98]. 

For constraint-satisfaction problems, some of the neurons are externally 
clamped to some input patterns, and we then find the global minimum for 
these particular input patterns. The integration of simulated annealing into the 
Boltzmann learning rule makes the Boltzmann machine especially suitable for 
constraint-satisfaction tasks involving a large number of weak constraints [81]. 

Boltzmann machines can be regarded as Markov random fields with binary 
random variables. For binary cases, they are equivalent to the Ising-spin model 
in statistical mechanics. Learning in Boltzmann machines is an NP-hard problem. 
The original Boltzmann learning algorithm [81, 2] is based on counting occur- 
rences. The Boltzmann learning algorithm based on correlations [151] provides a 
better performance than the original algorithm. The correlation-based learning 
procedure for the Boltzmann machine is given in Algorithm 19.2 [151, 76]. 

Boltzmann learning is implemented in four steps: initialization, clamping 
phase, free-running phase and weight update. The algorithm iterates from sec- 
ond to fourth steps for each epoch. Equation (19.34) is called the Boltzmann 
learning rule. 

The Boltzmann machine is suitable for modeling biological phenomena, since 
biological neurons are stochastic systems. This process is, however, too slow 
though it can find the global optimum. 


Mean-field-theory machine 


Mean-field approximation is a well-known method in statistical physics [70], 
[184]. The mean-field annealing algorithm was proposed to accelerate the con- 
vergence of the Boltzmann machine [151]. The Boltzmann machine with such an 
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Algorithm 19.2 (Boltzmann Learning). 1. Initialization. 
a. Initialize wji: Set wji as uniform random values in [—ao, ao], typically 
ao = 0.5 or 1. 
b. Set the initial and the final temperatures: To and Ty. 
2. Clamping phase. 
Present the patterns. For unsupervised learning, all the visible nodes are 
clamped to the patterns. For supervised learning, all the input and output 
nodes are clamped to the pattern pairs. 
a. For each example, perform simulated annealing until Ty is reached. 
i.At each T, relax the network by the Boltzmann distribution for a length 
of time through updating the states of the unclamped (hidden) units 


eS i with probability P; (19.31) 


—1, with probability 1 — P; 
where P; is calculated by (19.26) and (19.28). 


ii. Update T by the annealing schedule. 
b. At Ty, estimate the correlation in the clamped condition 


P= Blen] AF Syed: TAR (19.32) 


. Free-running phase. 

a. Repeat Step 2a. For unsupervised learning, all the visible neurons are now 
free-running. For supervised learning, only the input neurons are clamped 
and the output neurons are free-running. 

. At Ty, estimate the correlation in the free-running condition 


pij = E [ziz;], ij =1,2,...,J; 147. (19.33) 
. Weight update. 
The weight update is performed as 


where n = £ 


T? 
5. Repeat Steps 2 through 4 for next epoch until there is no change in wij, Vi, j. 


with £ being a small positive constant. 





algorithm is also termed the mean-field-theory machine or deterministic Boltz- 
mann machine. 

Mean-field annealing, which replaces all the states in the Boltzmann machine 
by their averages, can be treated as a deterministic form of Boltzmann learning. 
In [151], the correlations in the Boltzmann learning rule is replaced by the naive 
mean-field approximation. Instead of the stochastic binary neuron output for 
the Boltzmann machine, continuous neuron outputs, which are calculated as the 
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average of the probability of the binary neuron variables at temperature T, are 
used. The average of state x; is calculated for a specific value of activation net; 
according to (19.31), (19.26) and (19.28) 


Ela) = (+1)P; + (—1) (1 — P;) = 2P; — 1 
= tanh (=) (19.35) 


The correlation in the Boltzmann learning rule is replaced by the mean-field 





approximation [151] 


The above approximation method is usually termed the naive or zero-order 
mean-field approximation. The mean-field-theory machine is one to two orders 
of magnitude faster than the Boltzmann machine [74, 151]. 

However, the validity of the naive mean-field algorithm is challenged in [62, 98]. 
By applying naive mean-field approximation to a finite system with non-random 
interactions, the true stochastic system is not faithfully represented in many situ- 
ations. The independence assumption is shown to be unacceptably inaccurate in 
multiple-hidden-layer configurations. As a result, the mean-field-theory machine 
only works in supervised mode with a single hidden layer [62, 76]. The mean state 
is not a sufficient representation for the free-running probability distribution and 
thus, the mean-field method is ineffective for unsupervised learning. In [98], the 
naive mean-field approximation of the learning rules is shown to not converge in 
general; it leads to a converging gradient-descent algorithm only when (19.36) is 
satisfied for i # j. 

In [104], the equivalence between the asynchronous mean-field-theory machine 
and the continuous Hopfield model is established in terms of the same fixed points 
for networks using the same Hopfield topology and energy function. The naive 
mean-field-theory machine performs the steepest descent on an appropriately 
defined cost function under certain circumstances, and has been empirically used 
to solve a variety of supervised learning problems [82]. 

An approximate mean-field algorithm for the Boltzmann machine [98] has a 
computational complexity of cubic in the number of neurons. In the absence of 
hidden unit, the weights can be directly computed from the fixed-point equation 
of the learning rules, and thus a gradient-descent procedure is avoided. The 
solutions are close to the optimal ones; thus the method yields a significant 
improvement when correlations play a significant role. 

The mean-field annealing algorithm can be derived by optimizing the 
Kullback-Leibler divergence between the factorial approximating distribution 
and the ideal joint distribution of the binary neural variables in terms of the 
mean activations. In [201], two interactive mean-field algorithms are derived by 
extending the internal representations to include both the mean activations and 
the mean correlations. The two algorithms, respectively, estimate the mean acti- 
vations subject to the mean correlations, and the mean correlations subject to 
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the mean activations by optimizing the objective quantified by a combination of 
the Kullback-Leibler divergence and the correlation strength between any two 
distinct variables. The interactive mean-field algorithms improve the mean-field 
approximation in both performance and relaxation efficiency. 

In variational approaches, the posterior distributions are either approximated 
by factorized gaussians, or integrals over the posteriors are evaluated by saddle- 
point approximations [8]. The resulting algorithm is an EM-like procedure with 
the four estimations performed sequentially. Practical learning algorithms for 
Boltzmann machines are proposed in [209] by using the belief propagation 
algorithm and the linear response approximation, which are often referred as 
advanced mean-field methods. 


Stochastic Hopfield networks 


Like the Hopfield network, both the Boltzmann machine and the mean-field- 
theory machine can be used as associative memory. The Boltzmann and mean- 
field-theory machines that use hidden units have a far higher capacity for storage 
and error-correcting retrieval of random patterns and improved basins of attrac- 
tion compared to the Hopfield network [74]. 

When the Boltzmann machine is trained as associative memory using an adap- 
tive association rule [97], it does not suffer from spurious states. The association 
rule, which creates a sphere of influence around each stored pattern, is a gener- 
alization of the generalized Hebbian rule. Spurious fixed points, whose regions of 
attraction are not recognized by the rule, are skipped, due to the finite probability 
to escape from any state. The upper and lower bounds on retrieval probabilities 
of each stored pattern are also given in [97]. 

Due to the existence of the hidden units, neither the Boltzmann machine nor 
the mean-field-theory machine can be trained and retrieved in the same way as in 
the case of the Hopfield model. The retrieval process is as follows [74]. The visible 
neurons are clamped to a corrupted pattern, the whole network is annealed to 
a lower temperature, where the state of the hidden neurons approximates the 
learned internal representation of the stored pattern, and then the visible neurons 
are released. The annealing process continues until the whole network is settled. 

The Gaussian machine [4] is a general framework that includes the Hopfield 
network, the Boltzmann machine and also other stochastic networks. Stochastic 
distribution is realized by adding thermal noise, a stochastic external input €, 
to each unit, and the network dynamics are the same as that of the Hopfield 
network. The stochastic term £ obeys a Gaussian distribution with zero mean 
and variance o?, where the deviation ø = kT, and T is the temperature. The 
stochastic term £ can occasionally bring the network to states with a higher 


energy. When k = TE , the distribution of the outputs has the same behavior 
as a Boltzmann machine. When employing noise obeying a logistic distribution 
rather than a Gaussian distribution in the original definition, we can obtain a 
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Gaussian machine identical to a Boltzmann machine. When the noise in the 
Gaussian machine takes a Cauchy distribution with zero as the peak location 
and the half-width at the maximum ¢ = Tf 8 , we get a Cauchy machine [177]. 
A similar idea was embodied in the stochastic network given in [116], where in 
addition to a cooling schedule for temperature T, gain annealing is also applied. 
The gain 3 has to be decreased more slowly than T, and kept bounded away 
from zero. 


Training deep networks 


An association between the structural complexity of Bayesian networks and their 
representational power is established in [121]. The maximum number of nodes’ 
parents is used as the measure for the structural complexity of Bayesian networks, 
and the maximum number of XORs contained in a target function as the measure 
for the function complexity. Discrete Bayesian networks with each node having 
at most k parents cannot represent any function containing (k + 1)-XORs [121]. 

Deep Bayesian networks are generative neural network models with many lay- 
ers of hidden causal factors, introduced along with a greedy layerwise unsuper- 
vised pre-training followed by supervised fine-tuning algorithm [84]. The build- 
ing block of a deep Bayesian network is a probabilistic model called a restricted 
Boltzmann machine, used to represent one layer of the model. Upper layers of 
a deep belief network are supposed to represent more abstract concepts that 
explain the input observation a, whereas lower layers extract low-level features 
from æ. Using complementary priors, a fast, greedy algorithm that can learn 
deep, directed belief networks one layer at a time, is derived, provided the top 
two layers form an undirected associative memory. Inference is easy in restricted 
Boltzmann machines them. 

A restricted Boltzmann machine is a particular form of the product-of-experts 
model [83], which is also a Boltzmann machine with a bipartite connectivity 
graph. It is composed of two layers of binary stochastic nodes. The nodes in a 
visible layer are fully connected to the nodes in a hidden layer, and there is no 
intra-layer connection between nodes within the same layer. This connectivity 
ensures that all hidden units are statistically decoupled when visible units are 
clamped to the observed values. The restricted Boltzmann machine maximizes 
the log-likelihood for the joint distribution of all visible units, that is, the features 
and targets. It is trained in an unsupervised manner to model the distribution 
of the inputs. 

Restricted Boltzmann machines are universal approximators of discrete distri- 
butions, as given by Theorem 19.5 [114]. 


Theorem 19.5 (Universal approximation of restricted Boltzmann 
machine, Le Roux and Bengio (2008)). Any distribution over {0,1}" can 
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be approximated arbitrarily well (in the sense of the Kullback-Leibler divergence) 
with a restricted Boltzmann machine with k+1 hidden units where k is the 
number of input vectors whose probability is not 0. 


The existence proof of a universal deep Bayesian network approximator is due 
to Sutskever and Hinton [176]. Deep but narrow generative networks do not 
require more parameters than shallow ones to achieve universal approximation 
[115]. Deep but narrow feedforward neural networks with sigmoidal units can 
represent any Boolean expression [115]. 

Deep neural networks are trained in three steps [109]: pre-training one layer at 
a time in a greedy way; using unsupervised learning at each layer in a way that 
preserves information from the input and disentangles factors of variation; fine- 
tuning the whole network with respect to the ultimate criterion of interest. The 
choice of input distribution in restricted Boltzmann machines could be important 
for continuous-valued input and yields different types of filters at the first layer. 
The greedy layerwise unsupervised training strategy helps the optimization by 
initializing weights in a region near a good local minimum, but also implicitly 
acts as a sort of regularization that brings better generalization and encourages 
internal distributed representations that are high-level abstractions of the input 
[109]. 

Unsupervised pre-training is experimentally supported as an unusual form of 
regularization [58]: minimizing variance and introducing bias towards configu- 
rations of the parameter space that are useful for unsupervised learning. This 
type of regularization strategy is similar to the early stopping idea for training 
neural networks. When training using gradient descent, the beneficial general- 
ization effects due to pre-training do not appear to diminish as the number of 
labeled examples grows very large. Unsupervised pre-training sets the parameter 
in a region from which better basins of attraction can be reached, in terms of 
generalization. 


19.1 Bayes’ decision theory makes decision on two classes. A sample x belongs 
to class C1 if P(Ci|a) > P(C2|x), or it belongs to class C2 if P(C2|a”) > P(Ci|x). 
Show that the rule can be represented by: Decide that a belongs to C if 
P(x|C1)P(C1) > P(a|C2)P(C2), and to C2 otherwise. 


19.2 Monty Hall Problem. In Monty Hall’s television game show, there are 
three doors: behind one door is a car and behind the other two are two goats. 
You have chosen one door, say door A. The door remains closed for the time 
being. Monty Hall now has to open one of the two remaining doors (B or C). He 
selects door C, and there is a goat behind it. Then he asks you: “Do you want 
to switch to Door B?” What is your decision? Solve it using Bayes’ theorem. 
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19.3 Let Xı and Xə be results of flips of two coins, and X; an identifier if 
Xı and Xə coincide. We have P(X, = head) = P(X, = tail) = P(X2 = head) 
P(X = tail) = 0.5. 

1) Without evidence on X3, are nodes X; and Xə are maginally independent? 
2) Show that Xı and X, are dependent if the value of X3 is known: P(X, = 
head, X2 = head|X3 = 1) 4 P(X, = tail] X; = 1)P(X2 = tail|X3 = 1). 








19.4 Representation of the XOR problem requires to use hidden node in the 
MLP and the RBF network. This is also true for the Boltzmann machine. Solve 
the problem using the Boltzmann machine. 


19.5 Derive the likelihood ratio = in the case of Gaussian densities 


p(z|C1) ~ N (u1, 07) and p(2|C2) ~ N (u2, 03). 





19.6 The likelihood ratio for a two-class problem is F a . Show that the 
discriminant function can be defined by g(x) = p(a|C,)P(C1) — p(a|C2)P(C2): 
If g(x) > 0, we can choose class C1, otherwise class C2. Derive the discriminant 


function in terms of the likelihood ratio. 


19.7 For Possion’s distribution p(x|0) = 0” exp(—8), find @ using the maximum- 
likelihood method. [Hint: find 6 that maximizes p(2|@).] 


19.8 A discrete random variable X has the binomial distribution with param- 
eters n and p if its pdf is given by 


p(x|n, p) = ("Jena —p)"*, «x €{0,1,...,n},p € (0,1). 


Four products are made. Assume that the probability of any product being 
defective is p = 0.3 and each product is independently produced. 

(a) Calculate the probability of x € {0, 1,2,3,4} products being defective. 

(b) Calculate the probability of less than x products being defective. 


19.9 [was at work. I felt that the office building had shaked suddenly. I thought 
that a heavy-duty truck had hit the building by accident. From the web there 
were some messages that there had been an earthquake a moment before. Some- 
times a blast nearby also causes a building to shake. Was there an earthquake? 
Construct a Bayesian network, and tentatively give CPTs. 


19.10 Download the Bayes Nets Toolbox for MATLAB (bnt.googlecode. 
com), and learn to use this software for structure learning of Bayesian networks. 


19.11 A Bayesian network is shown in Fig. 19.8. Three CPTs are known. 
The first CPT is given by: P(C = T) = 0.5, P(C = F) = 0.5. The second CPT 
is given by: P(S = T|C = T) =0.1, P(S = F|C = T) = 0.9, P(S =T|C =F) = 
0.5, P(S = F|C = F) = 0.5. The third CPT is given by: P(R = T|C = T) = 
0.8, P(R = F|C = T) = 0.2, P(R = T|C = F) = 0.3, P(R = F|C = F) = 0.7. 
Calculate the CPT for P(W|S, R). 
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Figure 19.8 The sprinkler, rain and wet pavement: C= cloudy, S= sprinkler, R = rain, W = wet grass. 


19.12 Develop a computer program for random simulations of state transitions 
in a Markov chain with discrete time. 


19.13 Given the joint probability Pa p(ai,b;): Pa,p(ai,bi) = 0.03, 
Pap (a1, b2) = 0.05, P4, B(a1, b3) = 0.20, P4, B(a2, b1) = 0.15, PB (a2, b2) = 
0.02, Pap (a2, bs) = 0.25, P4 B(a3, bı) = 0.08, P4 B(a3, bz) = 0.25, 
Pp (az, b3) = 0.30. Compute Pa, Pp, Pays, Ppa: 


19.14 Use the HUGIN software to make queries on the Bayesian network 
given by Fig. 19.8). Assume that we have CPT 1: P(S = 0|C = 0) = 0.5, P(S = 
IC = 0) = 0.5, P(S = O/C = 1) = 0.8, P(S: = O|C = 1) = 0.2; 

CPT 2 P(R=0|C =0) =0.9, P(R= 1\C=0) =0.1,P(R=0\C =1)= 
0.1, P(R =0|C = 1) = 0.9; 

CPT 3: P(W =1|R=0,5 =0) = 0.05, P(W = 1|R=0, 5 = 1) = 0.8, P(W = 
1R=1,5 =0) =0.8, P(W =1]R=1,S =1) =0.95; 

CPT 4: P(C =0) = P(C = 1) = 0.5. 

(a) Calculate P(C = 1|W = 1), P(C = 1|R = 1), P(C=1|S = 1). 

(b) Draw the DAG. Verify the result using HUGIN. 


























19.15 Consider the probability model as (x1, £2, £3) = pı (£1, £2)p2(£2, £3). 
Plot the factor graph of the model. 


19.16 Discuss the similarities and differences between the Metropolis algorithm 
and the Gibbs sampler. 


19.17 Given the expression for the area of a circle, A = mr? 


formly distributed random variates, devise a sampling approach for computing 


, and using uni- 


Ws 
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20.1 


Combining multiple learners: data 
fusion and emsemble learning 


Introduction 


Different learning algorithms have different accuracies. The no free lunch theorem 
asserts that no single learning algorithm always achieves the best performance 
in any domain. They can be combined to attain higher accuracy. Data fusion 
is the process of fusing multiple records representing the same real-world object 
into a single, consistent, and clean representation. Fusion of data for improving 
prediction accuracy and reliability is an important problem in machine learning. 

Fusion strategies can be implemented in different levels, namely, signal 
enhancement and sensor level (data level), feature level, classifier level, deci- 
sion level, and semantic level. Evidence theory [62] falls within the theory of 
imprecise probabilities. 

For classification, different classifiers can be generated by different initializa- 
tions of a classifier, training a classifier with different training data, or training a 
classifier using different feature sets. Ensemble techniques [5], [58] build a number 
of different predictors (base learners), then combine them together to form the 
composite predictor to classify the test set. This phenomenon is known as diver- 
sity [18]. Two classifiers are said to be diverse if they make different incorrect 
predictions on new data points. Classifier diversity plays a critical role in emsem- 
ble learning. The ensemble of predictors is often called a committee machine (or 
mixture of experts). Ensemble learning has its capability of improving the clas- 
sification accuracy of any single classifier, given the same amount of training 
information. 

For ensemble learning, each of the classifiers composing the ensemble can be 
constructed either independently (e.g. bagging [5]) or sequentially (e.g. boosting 
[58]). The majority voting scheme is the most popular classifier fusion method. 
In this method, the final class is determined by the maximum number of votes 
counted among all the classifiers fused. Averaging is a simple but effective method 
and is used in many classification problems; the final class is determined by the 
average of continuous outputs of all classifiers fused. 
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Ensemble learning methods 


Bagging [5] and boosting [58] are two popular committee machines. They run 
a learning algorithm on different distributions over the training data. Bagging 
builds training sets, called bags, of the same size of the original data set by 
applying random sampling with replacement. Unlike bagging, boosting draws 
tuples randomly, according to a distribution, and tries to concentrate on harder 
examples by adaptively changing the distributions of the training set on the base 
of the performance of the previous classifiers. Bagging and random forests are 
ensemble methods for classification, where a committee of trees each cast a vote 
for the predicted class. In boosting, unlike random forests, the committee of weak 
learners evolves over time, and the members cast a weighted vote. 

Subspace or multiview learning creates multiple classifiers from different fea- 
ture spaces. In this way, different classifiers build their decision boundaries in 
different views of the feature space. Multiview learning utilizes the agreement 
among learners to improve the overall classification performance. Representa- 
tive works in this area are the random subspace method [34], the random forest 
method [33], and the rotation forest [57]. 

The mixture of experts [35] is a divide-and-conquer algorithm that contains a 
gating network for soft partitioning the input space and expert networks mod- 
eling each of these partitions. The methodology provides a tool of classification 
when the set of classifiers are mixed according to a final gating mechanism. 
Both the classifiers and the gating mechanism are trained at the same time over 
a given data set. The mixture of experts can be treated as an RBF network 
where the second-layer weights w are outputs of linear models, each taking the 
input, and these weights are called experts. Bounds for the VC dimension of the 
mixtures-of-experts architecture is derived in [36]. 

In Bayesian committee machine [70], the data set is divided into M subsets of 
the same size and M models are derived from the individual sets. The predictions 
of the individual models are combined using a weight scheme which is derived 
from a Bayesian perspective in the context of Gaussian process regression. That 
is, the weight for each individual model is the inverse covariance of its prediction. 
Although it can be applied to a combination of any kind of estimators, the main 
foci are Gaussian process regression and related systems such as regularization 
networks and smoothing splines for which the degrees of freedom increase with 
the number of training data. The performance of Bayesian committee machine 
improves if several test points are queried at the same time and is optimal if 
the number of test points is at least as large as the degrees of freedom of the 
estimator. 

Stacking is an approach to combining the strengths of a number of fitted 
models. It replaces a simple average by a weighted average, where the weights 
take account of the complexity of the model or other aspects. Stacking is a 
non-Bayesian model averaging, where the estimated weights, corresponding to 
Bayesian priors that downweight complex models, are no longer posterior prob- 
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abilities of models; they are obtained by a technique based on crossvalidation. 
In [10], Bayesian model averaging is compared to stacking. When the correct 
data generating model is on the list of models under consideration, Bayesian 
model averaging is never worse than stacking. Bayesian model averaging is more 
sensitive to model approximation error than stacking is, when the variabilities 
of the random quantities are roughly comparable. It is outperformed by stack- 
ing when the bias exceeds one term of size equal to the leading terms in the 
model or when the direction of deviation has a different functional form (with 
higher variability) that the model list cannot approximate well. Overall, stack- 
ing has better robustness properties than Bayesian model averaging in the most 
important settings. 

Stack generalization [73] extends voting by combining the base-learners 
through a combiner, which is another learner. Stacking estimates and corrects 
the biases of the base-learners. The combiner should be trained on data that are 
not used for training the base-learners. 

Cascading is a multistate method where d; (class j) is used only if all preceding 
learners, dk, k < j are not confident. Associated with each learner is a confident 
wj such that dj is confident of its output and can be used if wj > 0j, where the 
confidence threshold satisfies 1/K < 0; < 0j41 < 1, where K is the number of 
classes. For classification, the confidence function is set to the highest posterior: 
Wj = Max; dji: 

Neural networks and SVMs can also be regarded as an ensemble method. 
Bayesian methods for nonparametric regression can also be viewed as ensemble 
methods, where a large number of candidate models are averaged with respect 
to the posterior distribution of their parameter settings. A method designed for 
multiclass classification using error-correcting output codes (ECOCs) [15] is a 
learning ensemble. In fact one could characterize any dictionary method, such as 
regression splines, as an ensemble method, with the basis functions serving the 
role of weak learners. A survey of tree-based ensemble methods is given in [16]. 


Aggregation 


Aggregation operators combine data from several sources to improve the quality 
of information. In fuzzy logic, t-norm and t-conorm are two aggregation opera- 
tors. The ordered weighted averaging operator [75] is a well-known aggregation 
operator for multicriteria decision-making. It provides a parameterized family 
of aggregation operators with the maximum, minimum and average as special 
cases. 

The base algorithms can be different algorithms, or the same algorithm with 
different parameters, or the same algorithm using different features of the same 
input, or different base learners trained with a different subset of the training 
set, or cascaded training based base-learners. A main task can also be defined in 
terms of a number of subtasks to be implemented by the base-learners, as is in the 
case of ECOCs. The combintion can be in parallel or multistage implementation. 
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Voting is the simplest way of comining multiple classifiers. It takes a linear 
combination of the outputs of learners. The final output is computed by 


L 
Yi = 5 Wjdji, (20.1) 
j=1 
subject to 
L 
Sou, = 1, wj > 0,3, (20.2) 
j=l 
where dji, j =1,...,L, is the vote of learner j for class C;, and w; is the weight 
of its vote. 


When w; = 1/L, we have simple voting. In the case of classification, this is 
called plurality voting where the class having the maximum number of votes 
is the winner. For two classes, this is majority voting, where the winner class 
gets more than half of the votes. Voting schemes can be viewed as a Bayesian 
framework with weights as prior model probabilities and model decisions as 
model-conditional likelihoods. 


Boosting 


Boosting, also known as ARCing (adaptive resampling and combining), was 
introduced in [58] for boosting the performance of any weak learning algorithm, 
i.e., an algorithm that generates classifiers which need only be a little bit better 
than random guessing. Schapire proved that the strong and weak PAC learnabil- 
ity are equivalent to each other [58], which is the theoretic basis for boosting. 
Boosting algorithms belong to a class of voting methods that produce a classifier 
as a linear combination of base or weak classifiers. Similar to boosting, multiple 
classifier systems [74] use a group of classifiers to compromise on a given task. 

Boosting works by repeatedly running a given weak learning machine on dif- 
ferent distributions of training examples and combining their outputs. Boosting 
is known as a gradient-descent algorithm over some classes of loss functions [28]. 
Boosting was believed to seldom overfit, which can arise when the number of 
classifiers is large. It continues to decrease generalization error long after the 
sample training error becomes zero, by adding more weak classifiers to the lin- 
ear combination of classifiers. Some studies have suggested that boosting might 
suffer from overfitting [55], [45], especially for noisy datasets. 

The original boosting approach, known as boosting by filtering [58], was moti- 
vated by PAC learning theory. It requires a large number of training examples. 
This limitation is overcome by AdaBoost [22], [23]. In boosting by subsampling, 
a fixed sampling size and a set of training examples are used, and they are resam- 
pled according to a given probability distribution during training. In boosting 
by reweighting, all the training examples are used to train the weak learning 
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machine, with weights assigned to each example. This technique is applicable 
only when the weak learning machine can handle the weighted examples. 

For binary classification, the output of a strong classifier H(a) is obtained 
from a weighted combination of all the weak hypotheses h;(a): 


T 
H(æ) = sign (f(a)) = sign (>: oun(e)) (20.3) 


where T is the number of iterations, and f(a) = ae a,h,(a) is a strong clas- 
sifier. In order to minimize the learning error, one must seek to minimize hy in 
each round of boosting, requiring the use of a specific confidence ay. 

Upper bounds on the risk of boosted classifiers are obtained, based on the fact 
that boosting tends to maximize the margin of the training examples [59]. Under 
some assumptions on the underlying distribution population, boosting converges 
to the Bayes risk as the number of iterations goes to infinity [8]. 

Boosting performs worse than bagging in the presence of noise [16], and it con- 
centrates not only on the hard areas, but also on outliers and noise [3]. Boosting 
has relative resistance to overfitting. Boosting, when running for an arbitrary 
large number of steps, overfits, though it takes a very long time to do it. In 
boosting, unlike in bagging, the committee of weak learners evolves over time, 
and the members cast a weighted vote. Boosting appears to dominate bagging on 
most problems, and becomes the preferred choice. Learn++.NC [48] is a variant 
of boosting. 


AdaBoost 


The adaptive boosting (AdaBoost) algorithm [22], [23] is a popular approach to 
ensemble learning. Theoretically, AdaBoost can decrease the error of any weak 
learning algorithm. It is a Newton method for optimizing a particular exponential 
loss function [28]. AdaBoost provably achieves arbitrarily good bounds on its 
training and generalization errors provided that weak classifiers can perform 
slightly better than random guessing on every distribution over the training set 
[23]. 

AdaBoost uses the whole data set to train each classifier serially. It adap- 
tively changes the distribution of the sample depending on how difficult each 
example is to classify. After each round, it gives more focus to difficult instances 
that were incorrectly classified during the current iteration. Hence, it gives more 
focus to examples that are harder to classify. After each iteration, the weights of 
misclassified instances are increased, while those of correctly classified instances 
are decreased. Each individual classifier is also weighted according to its overall 
accuracy; these weights are then used in the test phase. Finally, when a new 
instance is presented, each classifier gives a weighted vote, and the class label is 
obtained by majority voting. 
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Algorithm 20.1 (AdaBoost). 


1. Dili) =1/N,i=1,...,N. 
2. fort=1toT 
a. Find a weak learner hy from S, D;: 
a= Touti Dil). 
h; = arg minn, €t. 
b. if & > 0.5 then 
T—t-l. 
return 
end if 


1 
Cc. At = 4 In = 


d. Deyli) = Dy(i)e~ eG = 1, N. 
A Pai ;— 

Dei) = pa De (j)? ae mi 

3. end for 


4. Output: H(x) = sign Ea arhi(æ)) ; 





+ 


prey 


Adaboost is shown in Algorithm 20.1. Assume that a training set S = 
{(xi yi) i =1,..., N} and y; € {—1, +1} is a sample of i.i.d. observations dis- 
tributed as the random variable (a, y) over an unknown distribution P. D, is 
the data distribution. The output is a boosted classifier H (æ). Note that in the 
updating equation of Di, —aryihi(a;) < 0 when y(i) = hi(a;), and > 0 when 
y(i) Æ h(x). As a result, after selecting an optimal classifier h; for D+, the 
examples x; identified correctly by the classifier are weighted less and those 
identified incorrectly are weighted more. When testing the classifiers on D41, it 
selects a classifier that better identifies those examples missed by the previous 
classifier. 

AdaBoost minimizes an exponential function of the margin over the training 
set [59]. Given a sample S = {(a;, yi), i = 1,..., N}, AdaBoost works to find a 
strong classifier f(a) = ae ath(x) that minimizes the convex criterion 


N 
— —yi Ti 
Tena" vif (wi) (20.4) 
AdaBoost allows to continue adding weak learners until a desired low training 
error has been achieved. 

It is demonstrated that a simple stopping strategy suffices for universal consis- 
tency [2]: the number of iterations is a fixed function of the sample size. Provided 
AdaBoost is stopped after N1“ iterations, for sample size N and e € (0,1), the 
sequence of risks, or probabilities of error, of the classifiers it produces approaches 
the Bayes risk. 

AdaBoost finds a linear separator with a large margin [59]. However, it does not 
converge to the maximal margin solution [56]. If the weak learnability assumption 
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holds, then the data is linearly separable [22]. The exact quantification of the 
weak learnability parameter and the Lı margin parameter is well addressed 
n [56]. AdaBoost with shrinkage asymptotically converges to an Lı-margin- 
maximizing solution. AdaBoost* [56] converges to the maximal margin solution in 
O(log(N)/e?) iterations. The family of algorithms proposed in [63] has the same 
convergence properties. These algorithms are effective when the data is linearly 
separable. Weak learnability is shown to be equivalent to linear separability with 
Lı margin [63]. A family of relaxations to the weak-learnability assumption that 
readily translates to a family of relaxations of linear separability with margin 
is described in [63]. Efficient boosting algorithms for maximizing hard and soft 
versions of the Lı margin are obtained. 

AdaBoost performance can be improved by damping the influences of those 
samples that are hard to learn. This is implemented in BrownBoost [24]. For 
inseparable data sets, the LogLoss Boost algorithm [11] tries to minimize the 
cumulative logistic loss, which is less sensitive to noise. BrownBoost [24] works 
well in the inseparable case and is noise tolerant. It uses the error-function (erf) 
as a margin-based loss function. SmoothBoost [61] builds on the idea of gen- 
erating only smooth distributions by capping the maximal weight of a single 
example. It can tolerate relatively high rates of malicious noise. As a special 
case of SmoothBoost, a linear threshold learning algorithm obtained matches 
the sample complexity and malicious noise tolerance of the online perceptron 
algorithm. 

For noisy data, overfitting effects can be avoided by regularizing boosting so as 
to limit the complexity of the function class. AdaBoostReg [55] and BrownBoost 
[24] are designed for this purpose. AdaBoost is highly affected by outliers. Using 
loss functions for robust boosting, the robust eta-boost algorithm [37] is robust 
against both mislabels and outliers, especially for the estimation of conditional 
probability. 

AdaBoost.M1 and AdaBoost.M2 [23] extend AdaBoost from binary classifica- 
tion to the multiclass case. AdaBoost.M1, as a straightforward generalization, 
halts if the classification error rate of the weak classifier produced in any itera- 
tive step is > 50% [59]. To avoid the problem, AdaBoost.M2 attempts to mini- 
mize a more sophisticated error measure called pseudoloss. The boosting process 
continues as long as the weak classifier produced has pseudoloss slightly better 
than random guessing. In addition, the introduction of the mislabel distribution 
enhances the communication between the learner and the booster. AdaBoost.M2 
can focus the learner not only on hard-to-classify examples, but on the incor- 
rect labels. AdaBoost.MH is an extension of AdaBoost to multi-class/multi-label 
classification problems [60]. AdaBoost.R [23] further extends AdaBoost.M2 to 
boosting regression problems. It solves regression problems by reducing them to 
classification ones. AdaBoost.RT [65] is a boosting algorithm for regression prob- 
lems. It filters out the examples with the relative estimation error that is higher 
than the preset threshold value, and then follows the AdaBoost procedure. 
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FloatBoost [44] learns a boosted classifier for achieving the minimum error 
rate. FloatBoost learning uses a backtrack mechanism after each iteration of 
AdaBoost learning to minimize the error rate directly, rather than minimizing 
an exponential function of the margin as in AdaBoost. A stagewise approxi- 
mation of the posterior probability is used for learning best weak classifiers. 
These techniques lead to a classifier which requires fewer weak classifiers than 
AdaBoost, yet achieves lower error rates in both training and testing. The per- 
formance improvement brought about by FloatBoost is achieved at the cost of 
longer training time [44]. MultBoost [30] is a parallel variant of AdaBoost that 
can achieve parallelization both in space and time. Unlike AdaBoost, LogitBoost 
[28] is based on additive logistic regression model. 

Conventional AdaBoost selects weak learners by merely minimizing the train- 
ing error rate, that is, the ratio of the number of mis-classified samples to the total 
number of samples. Each misclassified sample is weighted equivalent, so training 
error can hardly represent the error degrees. However, the training error based 
criterion does not work well in some cases, especially for small-sample-size prob- 
lems, since more than one weak learner may give the same training error. Two 
key problems for AdaBoost algorithms are how to select the most discriminative 
weak learners and how to optimally combine them. To deal with these prob- 
lems, error-degree-weighted training error [31] is defined based on error degree, 
which is related to the distances from the samples to the separating hyperplane. 
The most discriminative weak learners are first selected by these criteria; after 
getting the coefficients that are set empirically, the weak learners are optimally 
combined by tuning the coefficients using kernel-based perceptron. 


Bagging 


Bagging, short for Bootstrap AGGregatING, works by training each classifier on a 
bootstrap sample [5]. The essential idea in bagging is to average many noisy but 
approximately unbiased models, hence reducing the prediction variance without 
affecting the prediction bias. It seems to work especially well for high-variance, 
low-bias procedures, such as trees. The effect of bagging on bias is uncertain, 
as a number of contradictory findings have been reported. The performance of 
bagging is generally worse than that of boosting. 

Trees are ideal candidates for bagging, since they can capture complex interac- 
tion structures in the data and have relatively low bias if grown sufficiently deep. 
Since each tree generated in bagging is identically distributed, the expectation 
of an average of B such trees is the same as the expectation of any one of them. 
This means the bias of bagged trees is the same as that of the individual trees. 
This is in contrast to boosting, where the trees are grown in an adaptive way to 
remove bias [26], and hence are not identically distributed. 

Bagging enhances the performance of a predictor by repeatedly evaluating the 
predictor on bootstrap samples and then forming an average over those samples. 
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Bagging works well for unstable modeling procedures, i.e., those for which the 
conclusions are sensitive to small changes in the data [5]. Bagging is based on 
bootstrap samples of the same size of the training set. Each bootstrap sample 
is created by uniformly sampling instances from the whole training set with 
replacement, thus some examples may appear more than once, while others may 
not appear at all. Bagging then constructs a new learner from each, and averages 
the predictions. In boosting, predictions are averaged with different weights. 

Bagging and boosting are non-Bayesian procedures that have some similarity 
to MCMC in a Bayesian model. The Bayesian approach fixes the data and per- 
turbs the parameters, according to current estimate of the posterior distribution. 
Bagging perturbs the data in an i.i.d fashion and then re-estimates the model 
to give a new set of model parameters. Finally, a simple average of the model 
predictions from different bagged samples is computed. Boosting is similar to 
bagging, but fits a model that is additive in the models of each individual base- 
learner, which are learned using non-i.i.d. samples. We can write all of these 
models in the form 

a K A 
f(@new) = 5 wi E(Ynew|®new, 91), (20.5) 
l=1 


where 0, is a large collection of model parameters. For the Bayesian model, w; = 
1/L, and the average estimates the posterior mean (20.5) by sampling 6; from 
the posterior distribution. For bagging, w; = 1/L as well, and Ô, corresponds to 
the parameters refit to bootstrap resamples of the training data. For boosting, 
w, = 1, but Ô, is typically chosen in a nonrandom sequential fashion to constantly 
improve the fit. 

Online bagging [49] implements bagging sequentially. It asymptotically approx- 
imates the results of batch bagging, and it is not guaranteed to produce the same 
results as batch bagging. A variation is to replace the ordinary bootstrap with 
the Bayesian bootstrap. The online Bayesian version of bagging algorithm [43] 
is exactly equivalent to its batch Bayesian counterpart. The Bayesian approach 
produces a completely lossless bagging algorithm. It can lead to increased accu- 
racy and decreased prediction variance for smaller data sets. 


Random forests 


A forest is a graph where all its connected components are trees. Random forests 
are an ensemble method that extends the idea of bagging trees. In the random 
forest method, multiple decision trees are systematically generated by randomly 
selecting subsets of feature spaces [33] or subsets of training instances [7]. The 
rotation forest method [57] uses K-axis rotations to form new features to train 
multiple classifiers. 
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A random forest [7] is a collection of identically distributed classification trees, 
each of which is constructed by some partitioning rule. It is formed by taking 
bootstrap samples from the training set. The idea in random forests is to improve 
the variance reduction of bagging by reducing the correlation between the trees, 
without introducing a significant increase in the bias. This is achieved in the tree- 
growing process through random selection of the input variables. As in bagging, 
the bias of a random forest is the same as the bias of any of the individual 
sampled trees. Hence the improvements in prediction obtained by bagging or 
random forests are solely a result of variance reduction. Above a certain number 
of trees, adding more trees does not improve the performance [7]. 

For each bootstrap sample, a classification tree is formed, and there is no 
pruning—the tree grows until all terminal nodes are pure. After the tree is grown, 
one drops a new case down each of the trees. The classification that receives the 
majority vote is the one that is assigned. When used for classification, random 
forests obtains a class vote from each tree and then classifies using majority 
vote. When used for regression, the predictions from individual trees are simply 
averaged. As a classifier, random forests are fully competitive with SVM. It 
generates an internal unbiased estimate of the generalization error. It handles 
missing data very well, and can maintain high levels of accuracy. It also provides 
estimates of the relative importance of each of the covariates in the classification 
rule. 

For random forests, misclassification error is less sensitive to variance than 
MSE is. It has often been observed that boosting, like random forests, does not 
overfit, or is slow to overfit. The random forest classifier is in fact a weighted 
version of the k-NN classifier. On many problems the performance of random 
forests is very similar to that of boosting, and it is simpler to train and tune. 
As a consequence, the random forest method is popular, and is implemented in 
a variety of packages. 


Topics in ensemble learning 


Ensemble neural networks 

Boosting is used to combine a large number of SVMs, each trained on only a small 
data subsample [50]. Other parallel approaches to SVMs split the training data 
into subsets and distribute them among the processors. For a parallel mixture of 
SVMs [12], the model first trains many SVMs on small subsets and then combines 
their outputs using a gater such as linear hyperplane or MLP. The training time 
complexity can be driven down to O(N). Surprisingly, this leads to a significant 
improvement on generalization [12]. 

An extended experimental analysis of bias-variance decomposition of the error 
in SVM is presented in [71], considering Gaussian, polynomial and dot product 
kernels. The bias-variance decomposition offers a rationale to develop ensem- 
ble methods using SVMs as base learners. The characterization of bias-variance 
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decomposition of error for single SVMs also holds for bagged and random aggre- 
gating ensembles of SVMs: the main characteristics are maintained, with an 
overall reduction of the variance component [71]. 

A mixture model of linear SVMs [29] exploits a divide-and-conquer strategy 
by partitioning the feature space into subregions of linearly separable data points 
and learning a linear SVM for each of these regions. One can impose priors on 
the mixing coefficients and do implicit model selection in a top-down manner 
during the parameter estimation process. This guarantees the sparsity of the 
learned model. This is done by using the EM algorithm on a generative model, 
which permits the use of priors on the mixing coefficients. 

Ensemble clustering aggregates the multiple clustering solutions into one solu- 
tion that maximizes the agreement in the input ensemble [66]. Cluster ensemble 
is a more accurate alternative to individual clustering algorithms. The cluster 
ensembles considered in [42] are based on C-means clusterers. Each clusterer is 
assigned a random target number of clusters, k and is started from a random 
initialization. Vector quantization methods based on bagging and AdaBoost [64] 
can achieve a good performance in shorter learning times than conventional ones 
such as C-means and neural gas. Exact bagging of k-NN learners extends exact 
bagging methods from the conventional bootstrap sampling to bootstrap sub- 
sampling schemes [68]. 

For online learning of recurrent networks, the RTRL algorithm takes O(n*) 
computations for n neurons. Although EKF offers superior convergence proper- 
ties to gradient descent, the computational complexity per time step for EKF is 
equivalent to RTRL and it also depends on RTRL derivatives. Through a sequen- 
tial Bayesian filtering framework, the ensemble Kalman filter [47] is an MCMC 
method for estimating time evolution of the state distribution, along with an effi- 
cient algorithm for updating the state ensemble whenever a new measurement 
arrives. It avoids the computation of the derivatives. The ensemble Kalman fil- 
ter has superior convergence properties to gradient-descent learning and EKF 
filtering. It reduces the computational complexity to O(n?). 


Diversity versus ensemble accuracy 

The ensembles generated by existing techniques are sometimes unnecessarily 
large. The purpose of ensemble pruning is to search for a good subset of ensem- 
ble members that performs as well as, or better than, the original ensemble. A 
straightforward pruning method is to rank the classifiers according to their indi- 
vidual performance on a held-out test set and pick the best ones. This simple 
approach may work well but is theoretically unsound. For example, an ensemble 
of three identical classifiers with 90% accuracy is worse than an ensemble of three 
classifiers with 67% accuracy and least pairwise correlated error. Ensemble prun- 
ing can be viewed as a discrete version of weight-based ensemble optimization 
problem, which can be formulated as a quadratic integer programming problem 
to look for a subset of classifiers that has the optimal accuracy-diversity tradeoff 
and SDP can be applied as a good approximate solution technique [76]. 
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It is commonly admitted that large diversity between classifiers in a team is 
preferred. The most used diversity measure is certainly the Q-statistic [46]. Two 
statistically independent classifiers will have Q = 0. Q varies between —1 and 
1, the lower the value the more diverse the classifiers. Classifiers that tend to 
recognize the same objects correctly will have positive values of Q, and those 
which commit errors on different objects will make Q negative. 

Empirical studies show that the relationship between diversity measures and 
ensemble accuracy is somewhat confusing [41]. Theoretical insights show that 
the diversity measures are in general ineffective [69]. It has been proved that 
using diversity measures usually produces ensembles with large diversity, but 
not maximum diversity [69]. 

The diversity between classifiers and the individual accuracies of the classifiers 
clearly influence the performances of an ensemble of classifiers. An information- 
theoretic score is proposed in [46] to express a tradeoff between individual accu- 
racy and diversity. This technique can be directly used for selecting an optimal 
ensemble in a pool of classifiers. In the context of overproduction and selection of 
classifiers, the information-theoretic score-based selection outperforms diversity- 
based selection techniques. 


Theoretical analysis 

The improved generalization capabilities of ensembles of learning machines can 
be interpreted in the framework of large margin classifiers [1], in the context of 
stochastic discrimination theory [39], and in the light of bias-variance analysis 
[6], [27]. Ensembles enlarge the margins, enhancing the generalization capabilities 
of learning algorithms [59], [1]. Ensembles can reduce variance [6] and also bias 
[40]. 

Historically, the bias-variance insight uses squared-loss as the loss function. 
For classification problems, where the 0/1 loss is the main criterion, bias-variance 
decompositions related to the 0/1 loss have been proposed in [6], [25], [17]. For 
classification problems, the 0/1 loss function in a unified framework of bias- 
variance decomposition of the error is considered in [17]. Bias and variance are 
defined for an arbitrary loss function. Based on the unified bias-variance the- 
ory [17], methods and procedures have been proposed in [72] to evaluate and 
quantitatively measure the bias-variance decomposition of error in ensembles 
of learning machines. A bias-variance decomposition in the context of ECOC 
ensembles is described in [40] . 

The notion of margins [59] can be expressed in terms of bias and variance 
and vice versa [17], showing the equivalence of margin-based and bias-variance 
based approaches. Bias and variance are not purely additive: Certain types of 
bias can be canceled by low variance to produce accurate classification [25]. This 
can dramatically mitigate the effect of the bias associated with some simple esti- 
mators like naive Bayes, and the bias induced by the curse of dimensionality on 
nearest-neighbor procedures. This explains why such simple methods are often 
competitive with and sometimes superior to more sophisticated ones for clas- 
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Figure 20.1 Unclassifiable regions by the one-against-all formulation. 
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sification, and why bagging/aggregating classifiers can often improve accuracy 
[25]. 


Solving multiclass classification 


A common way to model multiclass classification problems is to design a set of 
binary classifiers and then to combine them. ECOCs are a general framework to 
combine binary problems to address the multiclass problem [15]. 


One-against-all strategy 


One-against-all strategy for K (K > 2) classes is the most simple and frequently 
used one. Each class is trained against the remaining K — 1 classes that have been 
collected together. For the ith two-class problem, the original K-class training 
data are labeled as belonging to or not belonging to class i and are used for 
training. Thus, a total of K binary classifiers are required. Each classifier needs 
to be trained on the whole training set, and there is no guarantee that good 
discrimination exists between one class and the remaining classes. This method 
also results in imbalanced data learning problems. 

We determine K direct decision functions that separate one class from the 
remaining classes. Let the ith decision function, with the maximum margin that 
separates class i from the remaining classes, be D;(a). on the boundary, D;(a) = 
0. To avoid the unclassiable region, shown as shaded region in Fig. 20.1, data 
sample æ is classified into the class with i = arg max;=),....K Dj(x). 


One-against-one strategy 


One-against-one (pairwise voting) strategy reduces the unclassifiable regions that 
occur for one-against-all strategy. But unclassifiable regions still exist. In one- 
against-one strategy, we determine the decision functions for all the combinations 
of class pairs. When determining a decision function for a class pair, we use the 
training data for the corresponding two classes. Thus, in each training session, 
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Figure 20.2 Unclassifiable regions by the one-against-one formulation. 
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the number of training data is reduced considerably compared to one-against-all 


strategy, which use all the training data. One-against-one strategy needs to train 
K(K —1)/2 binary classifiers, compared to K for one-against-all strategy, and 
each classifier separating a pair of classes. Outputs of K(K — 1)/2 times binary 
tests are required to make a final decision with majority voting. This approach 


is prohibitive for large K. 


Let the decision function for class 7 against class j, with the maximum margin, 


be D,;(a). We have D,;(a) = —D;;(a). The regions 
Ri = {a|D,;(a) >0,j=1,...,K,j#i}, i=1,...,K 


(20.6) 


do not overlap. If x € R;, x is considered to belong to class i. The problem that 
x may not be in any of R; may occur. Therefore, we classify æ by voting. By 


calculating 


we classify x into class k = arg maxj=1,....« Di(æ). 


(20.7) 


Ifa € Ri, Di(x) = K — 1 and Dj < K — 1 for j Æ i. In this case, æ is correctly 
classified. If any of D;(x) # K — 1, k may have multiple values. In this case, x 
is unclassifiable. In Fig. 20.2, the shaded region is unclassifiable, but it is much 


smaller than that for the one-against-all case. 


Similar to the one-against-all formulation, the membership function is intro- 


duced to resolve unclassifiable regions while realizing the same classification 
results with those of the conventional one-against-one classification for the clas- 
sifiable regions. The all-and-one approach [51] is based on the combination of the 
one-against-all and one-against-one methods and partially avoids their respective 


sources of failure. 


A modification to one-against-one is made in directed acyclic graph SVM 
(DAGSVM) [52]. The training phase is same as that of the one-against-one 
method, however, in the testing phase it uses a rooted binary directed acyclic 
graph with A (A — 1)/2 internal nodes and K leaves. It requires to be evaluated 


only K — 1 binary classifiers during the testing phase. 
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A one-vs.-one-vs.-rest scheme is implemented in multi-class smooth SVM [9] 
that decomposes the problem into K(k — 1)/2 ternary classification subprob- 
lems based on the assumption of ternary voting games. The approach outper- 
forms the one-against-one and one-against-rest for all datasets. 


Error-correcting output codes (ECOCs) 


In the ECOC framework, at the coding step, a set of binary problems is defined 
based on the learning of different sub-partitions of classes by means of a base 
classifier. Then, each of the partitions is embedded as a column of a coding 
matrix Q. The rows of Q correspond to the codewords codifying each class. At 
the decoding step, a new data sample that arrives in the system is tested, and a 
codeword is formed as a result of the output of the binary problems. For a test 
sample, it looks for the most similar class codeword by using a distance metric 
such as the Hamming or the Euclidean distance. If the minimum Hamming 
distance between any pair of codewords is t, up to |(t — 1)/2] single bit errors 
in Q can be corrected. 

Unlike the voting procedure, the information provided by the ECOC 
dichotomizers is shared among classes in order to obtain a precise classifica- 
tion decision, being able to reduce errors caused by the variance and the bias 
produced by the learners [40]. 

In the binary ECOC framework, all positions from the coding matrix Q belong 
to the set {+1, —1}. This makes all classes to be considered by each dichotomizer 
as a member of one of the two possible partitions of classes that define each binary 
problem. In this case, the standard binary coding designs are the one-agaist-all 
strategy and the dense random strategy [15], which requires N and 10 log, N 
dichotomizers, respectively [1]. 

In ECOCs, the main classification task is divided into a number of subtasks 
that are implemented by base-learners. We then combine the simpler classifiers 
and get the final classification result. Base-learners are binary classifiers with 
output —1/ + 1, and we have a coding matrix Q of K x L, for K rows of binary 
codes of classes and L base-learners dj. ECOC can be given by a voting scheme 
where the entries qj; are vote weights: 


L 
j=l 


and we choose the class with the highest y;. 

In ternary ECOCs [1], the positions of the coding matrix Q can be +1, —1 
or 0. The zero symbol means that a given class is not considered in the learning 
process of a particular dichotomizer. The ternary framework contains a larger 
set of binary problems. The huge set of possible bi-partitions of classes from the 
ternary ECOC framework has suggested the use of problem-dependent designs 
as well as new decoding strategies [20], [19], [53]. The coding designs in the 
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same ternary ECOC framework are the one-against-one strategy [32], one-versus- 
all and the sparse random strategy [1]. One-against-one strategy considers all 
possible pairs of classes. Thus, its codeword length is K(K — 1)/2, corresponding 
to K(K —1)/2 binary classification problems. The sparse random strategy is 
similar to the dense random design, but it includes the third symbol 0 with 
another probability, and a sparse code of length 15 log, K has been suggested 
in [1]. The discriminant ECOC approach [53] requires N — 1 classifiers, where 
N — 1 nodes of a binary tree structure is codified as dichotomizers of a ternary 
problem-dependent ECOC design. 

In the ECOC framework, one-against-one coding tends to achieve higher per- 
formance than other coding designs in real multiclass problems [20]. A high per- 
centage of the positions are coded by zero of the coding matrix, which implies 
a high sparseness degree. The zero symbol introduces two kinds of biases that 
require redefinition of the decoding design [20]. A type of decoding measure and 
two decoding strategies are defined. These decoding strategies avoid the bias pro- 
duced by the zero symbol and all the codewords work in the same dynamic range, 
thus significant performance improvement is obtained on the ECOC designs [20]. 
A general extension of the ECOC framework to the online learning scenario is 
given in [21]. The final classifier handles the addition of new classes independently 
of the base classifier used. The online ECOC approaches tend to the results of 
the batch approach, and they provide a feasible and robust way for handling new 
classes using any base classifier. 

In ECOC strategy, ECOCs are employed to improve the decision accuracy. 
Although the codewords generated by ECOCs have good error-correcting capa- 
bilities, some subproblems generated may be difficult to learn [1]. The simpler 
one-against-all and one-against-one strategies can present results comparable or 
superior to those produced by ECOC strategy in several applications. ECOC 
strategy at least needs K times the number of tests. A MATLAB ECOC library 
[20] contains both state-of-the-art coding and decoding designs. 

The combination of several binary ones in ECOC strategy is typically done via 
a simple nearest-neighbor rule that finds the class that is closest to the outputs 
of the binary classifiers. For these nearest-neighbor ECOCs, existing bounds on 
the error rate of the multiclass classifier is improved given the average binary 
distance [38]. The results show as to why elimination and Hamming decoding 
often achieve the same accuracy. In addition to generalization improvement, 
ECOCs can be used to resolve unclassifiable regions. 

For ECOCs, let gij be the target value of Dj; (a), the jth decision function for 
class i: 


= i a = 1, 02548, 20.9 
is a otherwise. í i ( ) 


The jth column vector g; = (g1j,---,9K;)’ is the target vector for the jth deci- 
sion function. If all the elements of a column are 1 or —1, classification is not 
performed by this decision function and two column vectors with g; = —g; result 
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in the same decision function. Thus the maximum number of distinct decision 
functions is 2*-1 — 1. 

The ith row vector (gi1,..., gik) corresponds to a codeword for class i, where 
k is the number of decision functions. In error-correcting codes, if the minimum 
Hamming distance between pairs of codewords is t, the code can correct at least 
|(t — 1)/2|-bit errors, where |a] gives the maximum integer value that does not 
exceed a. For three-class problems, there are at most three decision functions 


1-1-1 
as [gij] = | —1 1—1 |, which is equivalent to one-against-all formulation, and 
-1-1 1 


there is no error-correcting capability. By introducing don’t-care outputs 0, one- 
against-all, one-against-one and ECOC schemes are unified [1]. One-against-one 


1 0—1 
classification for three classes can be shown as [gi;]= | —1 1 0 
0—1 1 


Dempster-Shafer theory of evidence 


The Dempster-Shafer theory of evidence, which originates from the upper and 
lower probabilities, was first proposed by Dempster [13] and then further devel- 
oped by Shafer [62]. It can be viewed as a generalization of the Bayesian proba- 
bility calculus for combining probability judgements based on different bodies of 
evidence. The Dempster-Shafer method combines evidence regarding the truth of 
a hypothesis from different sources. It is the most frequently used fusion method 
at the decision-making level. The Dempster-Shafer method uses belief repre- 
senting the extent to which the evidence supports a hypothesis and plausibility 
representing the extent to which the evidence fails to refute the hypothesis. These 
resemble necessity and possibility in fuzzy logic. 


Definition 20.1 (Frame of discernment). A frame of discernment © is 
defined as a finite and exhaustive set of N mutually exclusive singleton hypothe- 
ses that constitutes a source’s scope of expertise 


O = {Aj, Ao,..., AN}. (20.10) 


A hypothesis A;, referred to as a singleton, is the lowest level of discernible 
information. 


Definition 20.2 (Basic probability assignment function). The basic prob- 
ability assignment function m(A) of event (or proposition) A is defined such that 
that the mapping: 2° (all events of O9) — [0,1] must satisfy m(0) = 0 and 


> m(A) = 1. (20.11) 
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The power set of ©, 2°, contains all subsets of ©. For example, if © = 
{A1, A2, As}, then 2° = {0, Aj, Ao, A3, A1 U A2, Ai U A3, A2 U A3, O}. 

With any event A in the frame of discernment 0, if the basic probability 
assignment m(A) > 0, the event A is called a focal element of m. If m has a 
single focal element A, it is said to be categorical and denoted as my. If all focal 
elements of m are singletons, then m is said to be Bayesian. 

Values of a basic probability assignment function are called belief masses. A 
basic probability assignment function with m(Ø) = 0 is said to be normal. 


Definition 20.3 (Belief function). With any hypothesis A C O, its belief func- 
tion Bel(A) : 2° — [0,1] is defined as the sum of the corresponding basic proba- 
bilities of all its subsets, namely 


Bel(A) = X` m(B),VAC ©. (20.12) 
BCA 


The belief function, which is also called lower limit function, represents the 
minimal support for A and can be interpreted as a global measure of one’s belief 
that hypothesis A is true. From the definition, we know Bel(@) = 0, Bel(O) = 1. 
Note that the basic probability assignment and belief functions are in one-to-one 
correspondence. 


Definition 20.4 (Plausibility function). The plausibility function of 
A,PI(A) : 2° — [0,1], is the amount of belief not committed to the negation of 
A 


PI(A) =1—Bel(A)= X` m(B), VAC®. (20.13) 
BNAA0 


The plausibility function, which is also called upper limit function, expresses 
the greatest potential belief degree in A. The plausibility function is a possibility 
measure, and the belief function is the dual necessity measure. It can be proved 
that [62] 


PI(A) > Bel(A), VACO. (20.14) 

If m is Bayesian, then function Bel is identical to Pl and it is a probability 
measure. 

Definition 20.5 (Uncertainty function). The uncertainty of A is defined by 

U(A) = PI(A) — Bel( A). (20.15) 


Definition 20.6 (Commonality function). The commonality function states 
how much basic probability assignment is committed to A and all of its supersets 


Com(A) = X` m(B). (20.16) 
BDA 
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Belief functions are widely used formalisms for uncertainty representation and 
processing. For a combination of beliefs, Dempster’s rule of combination is used 
in Dempster-Shafer theory. Under strict probabilistic assumptions, its results are 
probabilistically correct and interpretable. Dempster’s rule, also called orthog- 
onal summing rule, produces a new belief (probability) represented by a basic 
probability assignment output by synthesizing the basic probability assignments 
from many sources. Assuming that mı and mg are basic probability assignment 
values from two different information sources in the same frame of discernment, 
we have [62] 


K= X` m(B)m2(C) > 0 (20.18) 


measures the conflict between various evidence sources. If K = 1, the two pieces 
of evidence are logically contradictory and they cannot be combined. In order to 
apply Dempster’s rule in the presence of highly conflicting beliefs, all conflicting 
belief masses can be allocated to a missing (empty) event based on the open- 
world assumption. The open-world assumption states that some possible event 
must have been overlooked and thus is missing in the frame of discernment [67] 

In general, within the same frame of discernment ©, the combining result of 


n basic probability assignment values m1, M2, ..., Mn is given by 
1 n 
m(A) = mı 8 Mm -+ Mn = — [[ 7) ; (20.19) 
l= K Ia et 


n 
K= (Tre) . (20.20) 
nAi=0 \i=1 

The advantage of the Dempster-Shafer over the Bayesian method is that the 
Dempster-Shaferr method does not require prior probabilities; it combines cur- 
rent evidence. The Dempster-Shafer method fails for fuzzy systems, since it 
requires the hypotheses to be mutually exclusive. 

Dempster’s rule assumes the classifiers to be independent. For combining non- 
independent classifiers, the cautious rule and, more generally, t-norm based rules 
with behavior ranging between Dempster’s rule and the cautious rule can be used 
[54]. An optimal combination scheme can be learned based on a parameterized 
family of t-norms. 

When implementing the decision-making level fusion with the Dempster- 
Shafer theory, it is required to set up corresponding basic probability assign- 
ments. To avoid the trouble of establishing basic probability assignments, MLP 
can be used as an aid. The independent diagnosis of each category of feature data 
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is first conducted using MLP. The MLP outputs are processed by normalization 
and then are taken as the basic probability assignments of tremor types. A k-NN 
rule based on Dempster-Shafer theory [14] can handle partially supervised data, 
in which uncertain class labels are represented by belief functions. 


20.1 An ordered weighted averaging operator of dimension n is a mapping 
OWA: R” — R that has an associated weighting vector w of dimension n with 
Xj- wj = 1 and wy € [0,1], such that 
OWA (ar, A25+-+5 An) = 5 wjbj, 
j=1 


where b; is the jth largest of the a; . 

(a) Show that it has the maximum, the minimum and the average as special 
cases. 

(b) Show that the operator is commutative, monotonic, bounded, and idempo- 
tent. 

(c) Show that the median operator can be expressed an an OWA func- 
tion with the weighting vector w = (0,...,0,1,0,...,0)7 for odd n and w = 


(0,...,0,5,5,0,-..,0)7 for even n. 


20.2 Consider the data set of two-dimensional patterns: (1,1,1), (1,2, 1), 
(2,1,1), (2,2,1), (4,3,2), (4,2,2), (5,2,2), (5,3,2), (4,4,3), (5,4,3), (5,5,3), 
(4,5,3), where each pattern is represented by two features and the class label. 
Implement bagging on the data set and classify the test pattern (3.5, 2.5) by the 
following steps: 

(a) Select two patterns from each class at random and use the 1-NN algorithm 
to classify the test pattern. 

(b) Perform this procedure five times and classify the test pattern according to 
majority voting. 


20.3 In the cascading method, it is required that 6;41 > 6;. Give an explanation 
as to why this is required. 


20.4 Consider the data set of patterns: (1,1,1), (1,2,1), (2,1,1), (2,2,1), 
(3.5, 2,2), (4,1.5,2), (4,2, 2), (5, 1,2), where each pattern is represented by two 
features and the class label. Classify a pattern with two features (3.1, 1) by using 
AdaBoost with the following weak classifiers: 

(a) If a; < 2 then the pattern belongs to class 1; else to class 2. 

(b) If zı < 3 then the pattern belongs to class 1; else to class 2. 

(c) If a; + z2 < 4 then the pattern belongs to class 1; else to class 2. 


20.5 Write a program implementing AdaBoost. 
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20.6 Show that in case of AdaBoost, for a given performance level of a strong 
learner, the more discriminative each weak learner is, the less the number of 
weak learners needed and the shorter the training time consumed. 
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21.1 


Introduction of fuzzy sets and logic 


Introduction 


In many soft sciences (e.g. psychology, sociology, ethology), scientists provide 
verbal descriptions and explanations of various phenomena based on observa- 
tions. Fuzzy logic provides the most suitable tool for verbal computation. It is 
a paradigm for modeling the uncertainty in human reasoning, and is a basic 
tool for machine learning and expert systems. Fuzzy logic has also been widely 
applied in control, data analysis, regression, and signal and image processing. 

The concept of fuzzy sets was first proposed by Zadeh [44]. The theory of 
fuzzy sets and logic, as a mathematical extension to classical theory of sets 
and binary logic, has become a general mathematical tool for data analysis. 
Fuzzy sets serve as information granules quantifying a given input or output 
variable, and fuzzy logic is a means of knowledge representation. In fuzzy logic, 
the knowledge of experts is modeled by linguistic IF-THEN rules, which build up 
a fuzzy inference system. Some fuzzy inference systems have universal function 
approximation capability; they can be used in many areas where neural networks 
are applicable. An exact model is not needed for model design. 

Knowledge-based systems represent a different perspective on the human 
brain’s epigenetic process. Unlike neural networks, there is no attempt to model 
the physical structure; an inference engine fulfils that role. Learning is modelled 
by the construction of rules that are produced under the guidance of domain 
experts and held in a knowledge base. The abstract process of reasoning occurs 
when the inference engine fires rules as a result of data input. This explicit knowl- 
edge representation and reasoning offer the advantage that knowledge can be 
updated dynamically. A knowledge-based system can provide a rationale behind 
its decisions. 

Fuzzy logic uses the notion of membership. It is most suitable for the rep- 
resentation of vague data and concepts on an intuitive basis, such as human 
linguistic description, e.g. the expressions approximately, good, strong. The con- 
ventional or crisp set can be treated as a special case of a fuzzy set. A fuzzy 
set is uniquely determined by its membership function, and it is also associated 
with a linguistically meaningful term. 

Fuzzy logic provides a systematic framework to incorporate human experi- 
ence. It is based on three core concepts, namely fuzzy sets, linguistic variables 
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and possibility distributions. A fuzzy set is an effective means to represent lin- 
guistic variables. A linguistic variable is a variable whose value can be described 
qualitatively using a linguistic expression and quantitatively using a membership 
function [45]. Linguistic expressions are useful for communicating concepts and 
knowledge with human beings, whereas membership functions are useful for pro- 
cessing numeric input data. When a fuzzy set is assigned to a linguistic variable, 
it imposes an elastic constraint, called a possibility distribution, on the possible 
values of the variable. 

Fuzzy logic is a rigorous mathematical discipline. Fuzzy reasoning is a straight- 
forward formalism for encoding human knowledge or common sense in a numer- 
ical framework, and fuzzy inference systems can approximate arbitrarily well 
any continuous function on a compact domain [17, 41]. Fuzzy inference systems 
and feedforward networks can approximate each other to any degree of accu- 
racy [5]. In [15], the Mamdani model and feedforward networks are shown to be 
able to approximate each other to an arbitrary accuracy. Gaussian-based Mam- 
dani systems have the ability of approximating any sufficiently smooth function 
and reproducing its derivatives up to any order [10]. The functional equivalence 
between a multilayer feedforward network and a zero-order Takagi-Sugeno-Kang 
(TSK) fuzzy system is proven in [21]. The TSK model is proved to be eqivalent 
to the RBF network under certain conditions [13]. Fuzzy systems with Gaussian 
membership functions are proved to be universal approximators for a smooth 
function and its derivatives [19]. In [42], the fuzzy system with nth-order B- 
spline membership functions and the CMAC network with nth-order B-spline 
basis functions are proved to be universal approximators for a smooth function 
and its derivatives up to the (n — 2)th order. 


Definitions and terminologies 


In this section, we give some definitions and terminologies used in the fuzzy logic 
literature. 


Definition 21.1 (Universe of discourse). The universal set X : Æ — [0,1] 
is called the universe of discourse, or simply the universe. The implication 
X — [0,1] is the abbreviation for the IF-THEN rule: “IF x is in X, THEN 
its membership function ux(x) is in (0, 1].” 


The universe X may contain either discrete or continuous values. 


Definition 21.2 (Linguistic variable). A linguistic variable is a variable 
whose value is linguistic terms in a natural or artificial language. 


For example, the size of an object is a linguistic variable, whose value can be 
small, medium and large. 
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Definition 21.3 (Fuzzy set). A fuzzy set A in ¥ is defined by 

A={(xz, alx) |r E X}, (21.1) 
where alx) € [0,1] is the membership function of x in A. For a(x), the value 
1 stands for complete membership of the set A, while O represents that x does 
not belong to the set at all. 


A fuzzy set can also be represented by 


ae 5 alti) if X is discrete 





LEX 2 
eg , 21.2 
Sy Hale) if X is continuous l ) 


The summation, integral and division signs syntactically denote the union of 
(x, wa(x)) pairs. 


Definition 21.4 (Support). The elements on a fuzzy set A whose membership 
is larger than zero are called the support of the fuzzy set 


supp(A) = {x € Alua(x) > 0}. (21.3) 


Definition 21.5 (Height). The height of a fuzzy set A is defined by 
h(A) = sup {pa(x)| £ € X}. (21.4) 


Definition 21.6 (Normal fuzzy set). If h(A) = 1, then a fuzzy set A is said 
to be normal. 


Definition 21.7 (Non-normal fuzzy set). If 0 < h(A) <1, a fuzzy set A is 
said to be non-normal. It can be normalized by dividing it by its height 


a(x) = = (21.5) 
Definition 21.8 (Fuzzy partition). For a linguistic variable, a number of 
fuzzy subsets are enumerated as the value of the variable. This collection of fuzzy 
subsets is called a fuzzy partition. Each fuzzy subset has a membership function. 
For a finite fuzzy partition { A1, A2,..., An} of a set A, the membership function 
for each x € A satisfies 


nm 
Sasa) =1 (21.6) 
i=1 
and A; is normal, that is, the height of A; is unity. 
A fuzzy partition is illustrated in Fig. 21.1. The fuzzy set for representing 
the linguistic variable human age is partitioned into three fuzzy subsets, namely 


young, middle age and old. Each fuzzy subset is characterized by a membership 
function. 
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A 
i Young Middle age Old 
0 20 40 60 x 


Figure 21.1 A fuzzy partition of human age. 


Definition 21.9 (Empty set). The subset of XY having no element is called 
the empty set, denoted by 0. 


Definition 21.10 (Complement). The complement of A, written A, =A or 
NOT A, is defined as g(x) = 1 — alx). Thus, ¥=0 and) = X. 


Definition 21.11 (a-cut). The a-cut or a-level set of a fuzzy set A, written 
uala], is defined as the set of all elements in A whose degree of membership is 
not less than a 


uala] = {x € Aļua(z)> a}, ae [0,1]. (21.7) 


A fuzzy set A is usually represented by its membership function p(x), x € A. 
The inverse of u(x) can be represented by x = ~1(a), a € [0,1], where each 
value of a may correspond to one or more values of x. A fuzzy set is usually 
represented by a finite number of its membership values. 

The resolution principle uses a-cuts to represent membership to a fuzzy set 

wa= \ [o uale), (21.8) 


0<a<1 


where the maximum is taken over all values of a. 


Definition 21.12 (Kernel or core). All the elements in a fuzzy set A with 
membership degree 1 constitute a subset called the kernel or core of the fuzzy set, 
written ker(A) = mall]. 


The support, kernel and a-cut of a fuzzy set are shown in Fig. 21.2 for a 
trapezoid membership function, where a, b, c and d are shape parameters. The 
a-cut shown is represented by pu4[a] = [a1, a2]. 


Definition 21.13 (Convex fuzzy set). A fuzzy set A is said to be convex if 
and only if 


wa (Aa, + (1 — A)a2) > pa (a1) A pa (z2), VAE [0,1], £1, £2 E X, (21.9) 
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Figure 21.2 Support, kernel and a-cut of a fuzzy set A. 
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Figure 21.3 Representations of a fuzzy number. (a) a-level sets. (b) Discretized membership function. 





where A denotes the minimum operation. 
Any a-cut set of a convex fuzzy set is a closed interval. 


Definition 21.14 (Concave fuzzy set). A fuzzy set A is said to be concave 
if and only if 


ua (Azı + (1 — A)z2) < pa (z1) V pa (£2), VA € [0,1], £1, £2 E€ X, (21.10) 
where V denotes the maximum operation. 
Definition 21.15 (Fuzzy number). A fuzzy number A is a fuzzy set of the 


real line with a normal, convex and continuous membership function of bounded 


support. 


Fuzzy numbers are fuzzified versions of classical crisp intervals. Thus, the 
theory of fuzzy numbers and their arithmetic should be a fuzzified version of 
interval analysis. A fuzzy number is usually represented by a family of a-level 
sets or by a discretized membership function, as illustrated in Fig. 21.3. 
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Definition 21.16 (Fuzzy singleton). A fuzzy set A= {( (a, palz )) |x € x} is 
said to be a fuzzy singleton if a(x) = 1 fora € X and pa(2’) = 0 a € X with 


ese. 


Definition 21.17 (Cardinality). Given a fuzzy set A defined in a finite or 
countable universe X, its cardinality, denoted card(A), is defined by 


card(A) = |A| = SO waa (21.11) 


LEX 


The cardinality of a fuzzy set is derived by summing up the membership 
degrees. Cardinality is used to measure the magnitude of a fuzzy set. It is associ- 
ated with the concept of granularity of information granules. Relative cardinality 
is obtained by dividing the magnitude of fuzzy set A by that of universal set ¥ 


JAI 


A 
IAI = Fp 


(21.12) 


Definition 21.18 (Equality). Fuzzy sets A and B defined in the same universe 
X are said to be equal, A= B, if and only if 


ualx) = uglz), Vre. (21.13) 


Definition 21.19 (Fuzzy subset and inclusion). A fuzzy set A= 
{(x,ualx))|x E X} is said to be a fuzzy subset of B = { (x, ug(£))| x£ € X}, 
denoted A C B, where C is the inclusion operator, if and only if every element 
of A is also an element of B, that is, 


ualz) < uglz), Vae x. (21.14) 


Definition 21.20 (Product of fuzzy sets). The product of fuzzy sets A and 
B, defined on the same universe of discourse X, denoted A- B, is also a fuzzy 
set, whose membership function is given by 


Hagl) = palx)ug(z). (21.15) 


Definition 21.21 (Hedge). A hedge transforms a fuzzy set into a new fuzzy 
set. Hedges are modifiers, adjectives or adverbs, which change truth values. 


Hedges are used to intensify or dilute the characteristic of a fuzzy set such as 
very and quite, or to approximate a fuzzy set or convert a scalar to a fuzzy set 
such as about, nearly, roughly. The use of hedges enables dynamical creation of 
fuzzy sets and this also helps to reduce the complexity of rules. For example, for a 
fuzzy set good with membership degree u4(x), very good can be described using 
membership degree u4 (x), while quite good can be described using membership 


1 
degree 14 (a). An illustration of hedge operations is given in Fig. 21.4. The 
membership function u(x) is a hedge operation that transforms a scalar 5 into a 
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= = = yery close 
= = = quite close |7 


























Figure 21.4 An illustration of hedge operations. 


1 
fuzzy number close to 5, and po(x) = pî (x) and us(x) = u? (x) are, respectively, 
hedge operators that realize very close to 5 and quite close to 5. 


Definition 21.22 (Power of a fuzzy set). The power of a fuzzy set A is a 
new fuzzy set, A*, whose membership function is given by 


pae (z) = [wa(x)]*. (21.16) 


This actually define the hedge functions. Concentration raises a fuzzy set A 
to power 2 and obtains a fuzzy set A?. Dilation of A generates a fuzzy set Al/?. 

In traditional fuzzy systems, the structure is characterized by using type-1 
fuzzy sets. Type-1 fuzzy sets, defined on a universe of discourse, maps an element 
of the universe of discourse onto a precise number in [0,1]. A type-2 fuzzy set can 
be informally defined as a fuzzy set that is characterized by a fuzzy membership 
function in the computational level. More genrally, 


Definition 21.23 (Type-n fuzzy set). A type-n fuzzy set is a fuzzy set whose 
membership values are type-(n — 1), n > 1, fuzzy sets on [0,1]. 


A type-2 fuzzy logic has more computational complexity. Type-2 fuzzy logic 
demonstrates improved performance and robustness relative to type-1 fuzzy logic 
when confronted with various sources of data uncertainties [16]. 

The representations of a-planes [23] and zSlices [40] offer a viable framework 
for representing and computing with the general type-2 fuzzy sets. The a-planes 
and the zSlices representation theorems allow us to treat the general type-2 fuzzy 
sets as a composition of multiple interval type-2 fuzzy sets, each raised to the 
respective level of either a or z. 


Definition 21.24 (fuzzy transform and inverse fuzzy transform). A 
direct fuzzy transform [31] uses a fuzzy partition of an interval [a,b] to con- 
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vert a continuous function assigned on [a,b] in a suitable n-dimensional vector. 
Then an inverse fuzzy transform converts this n-dimensional vector into another 
continuous function which approximates the original function up to an arbitrary 
accuracy. 


Fuzzy transform explains modeling with fuzzy IF-THEN rules as a specific 
transformation. Fuzzy transform can be regarded as a specific type of Takagi- 
Sugeno fuzzy system, with several additional properties [2]. 


Membership function 


A fuzzy set A over the universe of discourse 1, A C X — [0,1], is described 
by the degree of membership a(x) € [0,1] for each x € XY. Unimodality and 
normality are two important aspects of the membership functions. 

Piecewise-linear functions such as triangles and trapezoids are often used as 
membership functions in applications. The triangular membership function can 
be defined by 





ga aK<a<b 
b<a<c, (21.17) 


p(z; a, b, c) = c 
0, otherwise 





where the shape parameters satisfy a < b < c and b € X. The triangular mem- 
bership function is useful for modeling linguistic terms such as “The value is 
close to 10”. 

The trapezoid membership function can be defined by 








0, cIlaorr?>d 
= a<r<b 

u(x; a,b,c,d) = i” ae ; (21.18) 
dz, c<r<d 


where the shape parameters satisfy a < b< c < d. This function is shown in 
Fig. 21.2. It is suitable for modeling such linguistic terms as “He is in his twen- 
ties”. 

The Gaussian and bell-shaped functions have continuous derivatives, and are 
usually used to replace the triangular membership function when shape param- 
eters are adapted using a gradient-descent procedure. The Gaussian function is 
given by 


2 
u(z;c,o) =e 207, (21.19) 
and the bell-shaped function is defined by 


p(x; ¢, a, b) = (21.20) 
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Trapezoid Gauss Bell-shaped 
1 1 1 
Ea E E3 
S05 S05 S05 
0 0 0 
0 5 10 0 5 10 0 5 10 
x x x 
S-shaped Z-shaped m-shaped 
1 1 1 
E3 T ka 
= 0.5 = 0.5 = 0.5 
0 0 0 
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x x x 


Figure 21.5 Shapes of some popular membership functions. The parameters for each membership 
function are selected as: (a) Trapezoid, a = 2, b= 4, c= 6, d = 7. (b) Gaussian, ø = 1, c= 4. (c) 
Bell-shaped, a = 1, b = 3, c= 5. (d) S-shaped, 8 = 1.5, c= 5. (e) Z-shaped, 8 = —1.5, c= 5. (£) 
m-shaped, 3, = 6, & = 3, b2 = 6, co = 7. 














In (21.19) and (21.20), c is the center of the curves, and a, b and ø are their 
shape parameters. 
Another popular membership function is the sigmoidal function of the form 
1 
p(z; c, 3B) = Te Bae’ 


where c shifts the function to the left or to the right, and 8 controls the shape 
of the function. When (> 1 it is an S-shaped function, and when 8 < —1 it is 
a Z-shaped function. 


When an S-shaped function is multiplied by a Z-shaped function, a 7-shaped 
function is obtained: 


(21.21) 


1 1 


Lpo Aiea) 14} o ea (21.22) 


Hu (£; c1, B1, C2, B2) = 


where 61 > 1, 82 < —1, and c1 < cg. T-shaped membership functions can be used 
in situations where trapezoid membership functions are used. These popular 
membership functions are illustrated in Fig. 21.5. 


21.4 Intersection, union and negation 
The set operations intersection and union correspond to the logic operations 
conjunction (AND) and disjunction (OR), respectively. Intersection is described 


by the so-called triangular norm (t-norm), denoted by T(x, y), whereas union is 
described by the so-called triangular conorm (t-conorm), denoted by C(x, y). 
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If A and B are fuzzy subsets of ¥, then intersection J = AN B is defined by 

pr(2) = T (wale), wel2)) (21.23) 

Definition 21.25 (t-norm). A mapping T : [0,1] x [0,1] — [0,1] with the fol- 
lowing four properties is called t-norm. For all x,y,z € (0, 1], 


© Commutativity: T(x, y) = T(y, x). 

e Monotonicity: T(x, y) < T(x, z), ify < z. 
e Associativity: T(x,T(y, z)) =T(T (x,y), z). 
e Linearity: T(x,1) = 7x. 


Basic t-norms are listed below [6]: 


Tm(x,y) = min(a, y) (standard intersection), (21.24) 
T,(x,y) = max(0,xz +y — 1) (bounded sum), (21.25) 
T,(x,y) = £y (algebraic product), (21.26) 

z, if y=1 
Tsy) =< y; if g=] (drastic intersection). (21.27) 


0, otherwise 


There is relation 


T* (x,y) < Te(x, y) < Tp(£, Y) < Tm(£, y), Vz, y € [0,1], (21.28) 
where T*(x, y) and Tm(z, y) are, respectively, the lower and upper bounds of any 
t-norm. 

Similarly, union U = AUB is defined by 


Hu (x) = C (w(x), up(z)). (21.29) 


Definition 21.26 (t-conorm). A mapping C : [0,1] x [0,1] — [0,1] having the 
following four properties is called t-conorm. For all x,y,z € [0,1], 

e Commutativity: C(x, y) = Cly, x). 

e Monotonicity: C(x,y) < C(a2,z), ify < z. 

e Associativity: C(x, C(y, z)) = C (C(x, y), z). 

e Linearity: C(a,0) = z. 


Basic t-conorms are defined by [6] 


Cm(x, y) = max(z, y) (standard union), (21.30) 
C(x, y) = min(1, x + y) (bounded sum), (21.31) 
Cp(£, y) = £ +y — zy (algebraic sum), (21.32) 

z, if y=0 
C(t y) =s y if c=0 (drastic union). (21.33) 
1, otherwise 
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Accordingly, 
Caley) < Cp(£, y) < iy) < C* (x,y), Vx, y € [0,1], (21.34) 


where Cm(x,y) and C*(x,y) are, respectively, the lower and upper bounds of 
any t-contorm. 
When t-norm and t-conorm satisfy 


1—T(a,y) =C(1—2,1—-y), (21.35) 


T and C are said to be dual. This makes De Morgan’s laws AN B = AUB and 
AUB = ANB still hold in fuzzy set theory. The above basic t-norms and t- 
conorms with the same subscripts are dual. To satisfy the principle of duality, 
they are usually used in pairs. 


Definition 21.27 (Negation). A function N : [0,1] — [0,1] is called a (fuzzy) 
negation (~n) if it is monotonically nonincreasing, continous, N(0)=1 and 
N(1) =0. A negation N is said to be strict if it is strictly decreasing and con- 
tinuous, or strong if it is an involution, that is, N(N(x)) = 2,Va € [0,1]. 


Negation generizes the set notion of complement. 


Fuzzy relation and aggregation 


Definition 21.28 (Extension principle). Given mapping f : X —> YV, if we 
have a fuzzy set A= { (x, wy(x))| a E€ X}, alx) € [0,1], the extension principle 
is defined by 


f(A) = FL (@, wala) @ € ¥}) = { (F), ual) le E X}. (21.36) 


Application of the extension principle transforms x into f(x), but does not 
affect the membership function j1,4(2). 


Definition 21.29 (Cartesian product). If X and y are two universal sets, 
then X x Y is the set of all ordered pairs {(x,y)|x E€ V,y E€ VY}. Let A be a fuzzy 
set of X and B a fuzzy set of Y. The Cartesian product is defined by 


Ax B= {(z,waxe(z))| 2 = (z,y) E Z,Z = 8 x J}, (21.37) 


where WAxp(z) = a(x) A ugly), and A denotes t-norm operator. 


Definition 21.30 (Fuzzy relation). If R is a subset of X x V, then R is said 
to be a relation between X and Y, or a relation on X x Y. Mathematically, 


R(x, y) = {((2,y), urR(x,y))| (ty) E X x VY, ur(z,y) € [0,1 }, (21.38) 


where ur(x,y) is the degree of membership for association between x and y. 
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A fuzzy relation is a fuzzy set defined on Cartesian products. Fuzzy relation 
is used to describe the association between two things. 

The height of the fuzzy relation R(x, y) is given by 

h(R(x,y)) = max max fiz (x, y). (21.39) 
yey rex 

Definition 21.31 (Fuzzy matrix). Given finite, discrete fuzzy sets X = 
{x1,@2,...,2%m} and Y = {y1, yo,---, Yn}, a fuzzy relation on X x Y can be rep- 
resented by an m x n matriz called a fuzzy matrix, R = [ur (xi, y;)]- 


A fuzzy relation can also be represented by 
Risu) = F pelta) / Guu) (21.40) 
i,j 
The inverse relation of R(x, y), denoted by R+ (x,y), can be represented as the 
transpose of a membership matrix, and (R~+(z,y))"! = R(a, y). 


Definition 21.32 (Fuzzy graph). A fuzzy relation R(x, y) can be represented 
by a fuzzy graph. In a fuzzy graph, all x; and y; are vertices, and the grade 
Lr (zi, yj) is added to the connection from x; and yj. 


Definition 21.33 (Aggregation of fuzzy relations). Consider two fuzzy 
relations, Rı on X x Y and Ro on Y x Z, 


Ri(z, y) = {((£, y), UR: (2, y))| (x,y) EX YV, ur (x,y) € (0, 1]}, 
Roly, z) = { ((Y, 2), HR (Y, 2))| (Y; 2) € VY x Z, ura (y, 2) € [0, 1]}. (21.41) 


The max-min composition of Rı and Ra, denoted by Rı o Ra with membership 
function LR oR, is given by a fuzzy relation on ¥ x Z 


Ra o Ra = 4 ( (æ, 2), max {min (urm, (@- y), ira (02) ) E2) EX x Zy ev}. 
(21.42) 


Aggregation or composition operations on fuzzy sets provide a means for com- 
bining several sets in order to produce a single fuzzy set. In the definition, the 
aggregation operators max and min correspond to t-conorm and t-norm, respec- 
tively. 

There are some other composition operations, such as the min-max composi- 
tion, denoted by Rio Rez, with the difference that the role of max and min are 
interchanged. The two compositions are related by Rio Ro = Rı o Rə. 
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Fuzzy implication 


Classical modus ponens is “If A then B”, which can be read “If proposition A is 
true, then infer that proposition B is true”. modus ponens itself is a proposition, 
sometimes written as “A implies B” or “A — B”, where implies is a logical 
operator with A and B as operands. 

The implication operator is indispensable in the inference mechanisms of any 
logic, like modus ponens, modus tollens, and hypothetical syllogism in classical 
logic. Fuzzy implication is one of the key operations in fuzzy logic and approxi- 
mate reasoning. It is used for the management of fuzzy conditionals of the type 
“If A then B”. Fuzzy implication operators are usually functionally expressed 
through numerical functions 7 : [0,1] x [0,1] — [0,1] called implication functions 
or simply implications. 

Fuzzy implication A — B interprets the fuzzy rule: “IF x is A THEN y is 
B”. It is a mapping J of an input fuzzy region A onto an output fuzzy region 
B according to the defined fuzzy relation R on A x B: ur(x,y) = I(x, y). For a 
fuzzy rule expressed as a fuzzy implication using the defined fuzzy relation R, 
the output linguistic variable B is denoted by 


B=AoR, (21.43) 


which is characterized by ug(y) = Va (nalz) A ur(x,y)). 


Definition 21.34. A fuzzy implication is a function I : [0,1]? — [0,1] that sat- 
isfies the following properties: 


e Monotonicity: xı < x2 => I(x1,y) > I(x2,y). 
e Monotonicity: yı < y2 => I(a,y1) < I(x, y2). 
e Dominance of falsity: I(0,y) = 1. 

o Neutrality of truth: I(1,y) = y. 

s Exchange: I(x, I(x2,y)) => I(x2, I(x1,y)). 


It is obvious that 7(0,0) = (0,1) = I(1,1) = 1, I(1,0) = 0, and I(a,a)= 1, 
which satisfy the properties of classical implication. A fuzzy rule “IF x is A 
THEN y is B” is expressed as I (a,b), where a and b are the membership grades 
of A and B, respectively. 

Since conjunctions, disjunctions and negations are usually performed by t- 
norms, t-conorms and strong negations, the majority of the known implication 
functions are directly derived from these operators. Several well-known implica- 
tion operators so defined even do not satisfy the definition. 

The four most usual definitions for implications are S-, R-, QL- and D- 
implications [22]. They are equivalent in any Boolean algebra and consequently 
in classical logic. However, in fuzzy logic they yield distinct classes of fuzzy 
implications. The two classes most commonly used are R- and S-implications. 
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S-implications are defined by 
I(z,y)=C(N(z),y), x,y € [0,1], (21.44) 


where C is t-conorm and N is strong negation. They appear as an immediate 
generalization of the classical boolean implication p > q = =p V q. 
R-implications are defined by 


I(x,y) =sup{z € [0, 1]|T(x,z) < y}, «x,y € [0,1], (21.45) 


where T is a left-continuous t-norm. They come from residuated lattices based 
on the residuation property that in the case of t-norms can be written as 


T(x, y) < z 4> I(x,z)> y VYx,y,z € [0,1], (21.46) 


The majority of known implications not only satisfy the definition but also 
belong to some of the four types. For instance, the Lukasiewicz implication 
(I(x, y) = min(1, 1 — x + y)) belongs to the four types. The Kleene-Dienes impli- 
cation (I(x, y) = max(1 — z,y)) is an S-implication derived from the t-conorm 
maximum and the negation N(x) = 1 — z. 

The implication functions are used not only to represent IF-THEN statements 
but also to perform forward and backward inferences in fuzzy systems, with 
the two main classical inference rules (deduction rules) of modus ponens and 
modus tollens, respectively. The choice of fuzzy implication cannot be made 
independently of the inference rule that is going to be applied. 

QL-implication, known as propositional calculus, is based on the classical 
logic form =(a A 7(a A b)) = ~a V (a ^A b) and logical operators are substituted 
by fuzzy operators. S-implication, called material implication, derives from the 
classical logic form a — b = ~a V b. R-implication and D-implication reflect a 
partial ordering on propositions and are based on a generalization of modus 
ponens and modus tollens, respectively. 


Reasoning and fuzzy reasoning 


Reasoning processes are categorized into deductive and reductive types. Deduc- 
tive reasoning proceeds with inference from premises to a conclusion: If premises 
P and P — Q, then the conclusion Q. Reductive reasoning carries inference from 
conclusions to a set of plausible premises: If P — Q and Q, then P. Whereas 
deductive reasoning is exact, reductive reasoning is more intuitive. Inductive 
reasoning, as a special type of reductive reasoning, generalizes evidence to a 
hypothesis for a population. Inductive reasoning has no more justification than 
random guessing, and this can be drawn from the no free lunch theorem. 
Logics as bases for reasoning can be distinguished essentially by three items: 
truth values, vocabulary (operators) and reasoning procedures (tautologies, syl- 
logisms). A formal logical system largely consists of an axiom system (knowl- 
edge base) and an inference system. The axiom system consists of a set of axiom 
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schemes, and the inference system consists of a set of rules of inference. Two 
popular inference rules in mathematical logic, mathematics and AI are modus 
ponens and modus tollens. Modus ponens and modus tollens belong to rules of 
deduction. 

Fuzzy reasoning, also called approximate reasoning, is an inference procedure 
for deriving conclusions from a set of fuzzy rules and one or more conditions. 
Fuzzy reasoning employs the generalized fuzzy modus ponens. The compositional 
rule of inference is the essential rational behind fuzzy reasoning. 


Modus ponens and modus tollens 


Modus ponens 
Modus ponens is an important tool in classical logic for inferring one proposition 
from another, and has been used for roughly 2000 years. 

Consider modus ponens as one tautology: 


(PA(P >Q) >Q. (21.47) 


Modus ponens has a general form 


P > Q Implication: If P then Q. 
P Premise: P is true. 
Q Conclusion: Q is true. 


Thus, if P is true and P — Q is true, then the conclusion Q is also true. 


Modus tollens 
Modus tollens is given by 


Q 
a =P 


Thus, if Q is false and P — Q is true, then the conclusion —P is true. From an 
experience-based reasoning viewpoint, this is a formalized summary of experi- 
ence. 


Generalized modus ponens 


The truth value of a proposition is a measure in the interval [0, 1] of how sure we 
are that the proposition is true. Data have truth values, a measure of the extent 
to which the values of the data are valid; rules have truth values, a measure of the 
extent to which the rule itself is valid. In general, the truth value of something 
is a measure in [0,1] of its validity. 


ww ai bbt.com DOOOO00 


690 Chapter 21. Introduction of fuzzy sets and logic 


A basic truth-functional operator used in fuzzy logic is logical implication, 
written P — Q. It may be expressed in terms of or and not as 


(P — Q)=((-P) V Q). (21.48) 


Generalized modus ponens is given by 


P > Q 
Pp! 
oy 


where P’ and Q’ correspond to the two compound propositions P and Q. 
For example, 


Implication: If a tomato is red, then the tomato is ripe. 
Premise: This tomato is very red. 
Conclusion: This tomato is very ripe. 


For composition of propositions, one uses fuzzy connectives including negation 
(=), conjunction (A), disjunction (V), implication (—) and equivalence (4). Let 
P(x) and Q(y) be two fuzzy propositions which have the truth degree up(x) and 
oly) respectively, with x € Rp and y € Rg. The degrees of truth yielded by 
these fuzzy connectives are defined by 


Hpo(2,y) = max((up(x) < He(y)),Ha(y)) (Implication), (21.49) 


upso(z, y) = max((up(z) == poly)), min(up (2), no(y))) (Equivalence). 
(21.50) 
In fuzzy logic, after interpreting algebraically in terms of true values, general- 
ized modus ponens becomes: “The true value of P A (P — Q) must be less than 
or equal to the true value of Q,” which can be expressed as 


Tile 1(2,y)) <y, Vz, y € (0, 1], (21.51) 


where Tı is a t-norm performing conjunction and J is an implication function 
performing the conditional relation. When satisfying (21.51) with respect to t- 
norm 7}, it is said that I is T\-conditional. It is known that the R-implication 
derived from a left-continuous t-norm T; is always T,-conditional (in fact, it is 
the greatest T)-conditional). 


21.7.3 Fuzzy reasoning methods 
Generally, fuzzy systems can be divided into three categories: 


e Takagi-Sugeno reasoning—consequents are functions of inputs. 
e Mamdani-type reasoning—consequents and antecedents are related by the 
min operator or generally by t-norm. 
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e Logical-type reasoning—consequents and antecedents are related by fuzzy 
implications. 


Consider a fuzzy set A = { (x, 14(x)) |v € X} and a fuzzy relation R on A x B, 
R(x, y) = {((x,y), ur(x,y)) |(z,y) € ¥ x Y}. Fuzzy set B can be inferred from 
A and R according to the max-min composition 


B=AoR= { (v, max {min (nalz), UR(@, y))t) 


Generalized modus ponens can be formulated as: “If x is A then y is B. From 
x =A’, infer that y = B”. Here A and A’ are fuzzy sets defined on the same 
universe, and 6 and $’ are also fuzzy sets defined on the same universe, which 
may be different from the universe on which A and A’ are defined. By computing 
the fuzzy conclusion B’ using the compositional rule of inference 6b’ = A’ o R, we 
have 





nex,yey}. (21.52) 


B' = A' o (A —> B). (21.53) 


The inverse problem of approximate reasoning is to conclude A’ from B’ and 
A — B. By using the law of contrapositive symmetry, similarity-based inverse 
approximate reasoning [26] provides a solution to the problem. 


Example 21.1: Assume that 


0.3 0.7 1.0 0.5 1.0 0.6 1.0 0.6 0.3 
A= {3 or ch. s= f% 10 %8), af, 
Tı T2 T3 yı Y2 YB Tı T2 T3 
By choosing the Lukasiewicz implication operator I(x, y) = min(1,1 — z + y), 


we obtain R(x, y) of A— B as 


1 10.5 
R= 40.81 1 
0.6 1 0.6 


We are now ready to obtain B’ from A' o R. Using T = Tm, we get 
B'(y1) = max(min(1, 1), min(0.6, 0.8), min(0.3, 0.6)) = 1, 


B'(y2) = max(min(1, 1), min(0.6, 1), min(0.3,1)) = 1, 


B' (y3) = max(min(1, 0.5), min(0.6, 1), min(0.3, 0.6)) = 0.6, 


and 


p= {2 eee 
yı y2 y3 J 
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Fuzzy inference system 
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Figure 21.6 The architecture of a fuzzy controller. 


21.8 


Fuzzy inference systems 


In control systems, the inputs to the systems are the error and the change in the 
error of the feedback loop, while the output is the control action. Fuzzy logic 
based controllers are popular control systems. The general architecture of a fuzzy 
controller is depicted in Fig. 21.6. Fuzzy controllers are knowledge-based, where 
knowledge is defined by fuzzy IF-THEN rules. The core of a fuzzy controller is a 
fuzzy inference system, in which the data flow involves fuzzification, knowledge- 
base evaluation, and defuzzification. A fuzzy inference system is also termed a 
fuzzy expert system or a fuzzy model. 

In a fuzzy inference system, the knowledge base is comprised of the fuzzy rule 
base and the database. The database contains the linguistic term sets considered 
in the linguistic rules and the membership functions defining the semantics of 
the linguistic variables, and information about domains. The rule base contains 
a collection of linguistic rules that are joined by the also operator. An expert 
provides his knowledge in the form of linguistic rules. The fuzzification process 
collects the inputs and then converts them into linguistic values or fuzzy sets. 
The decision logic, called fuzzy inference engine, generates output from the input, 
and finally the defuzzification process produces a crisp output for control action. 

Fuzzy inference systems are universal approximators capable of performing 
nonlinear mappings between inputs and outputs. The interpretations of a certain 
rule and the rule base depend on the fuzzy system model. The Mamdani [20] and 
TSK [36] models are two popular fuzzy inference systems. The Mamdani model 
is a nonadditive fuzzy model that aggregates the output of fuzzy rules using 
the maximum operator, while the TSK model is an additive fuzzy model that 
aggregates the output of rules using the addition operator. Kosko’s standard 
additive model [18] is another additive fuzzy model. All these models can be 
derived from fuzzy graph [43], and are universal approximators |17, 41, 4, 5]. 

Both neural networks and fuzzy logic can be used to approximate an unknown 
control function. Neural networks achieve a solution using the learning process, 
while fuzzy inference systems apply a vague interpolation technique. Fuzzy infer- 
ence systems are appropriate for modeling nonlinear systems whose mathemati- 
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cal models are not available. Unlike neural networks and other numerical models, 
fuzzy models operate at a level of information granules—fuzzy sets. 


Fuzzy rules and fuzzy interference 


There are two types of fuzzy rules, namely fuzzy mapping rules and fuzzy implica- 
tion rules [43]. A fuzzy mapping rule describes a functional mapping relationship 
between inputs and an output using linguistic terms, while a fuzzy implication 
rule describes a generalized logic implication relationship between two logic for- 
mulas involving linguistic variables. Fuzzy implication rules generalize set-to-set 
implications, whereas fuzzy mapping rules generalize set-to-set associations. The 
former was motivated to allow intelligent systems to draw plausible conclusions 
in a way similar to human reasoning, while the latter was motivated to approx- 
imate complex relationships such as nonlinear functions in a cost-effective and 
easily comprehensible way. The foundation of fuzzy mapping rule is fuzzy graph, 
while the foundation of fuzzy implication rule is a generalization to two-valued 
logic. 

A rule base consists of a number of rules in IF-THEN logic: “IF condition, 
THEN action.” The condition, also called premise, is made up of a number of 
antecedents that are negated or combined by different operators such as and or 
or computed with t-norms or t-conorms. In a fuzzy-rule system, some variables 
are linguistic variables and the determination of the membership function for 
each fuzzy subset is critical. Membership functions can be selected according to 
human intuition, or by learning from training data. 

A fuzzy inference is made up of several rules with the same output variables. 
Given a set of fuzzy rules, the inference result is a combination of the fuzzy 
values of the conditions and the corresponding actions. For example, we have a 
set of N, rules 


Ri: IF (condition = Ci) THEN (action= Ai), i=1,..., Np, 


where C; and A; are fuzzy sets. Assuming that a condition has a membership 
degree of u; associated with C;. The condition is first converted into a fuzzy 
category using a syntactical representation 





N 
LG C C C 
condition = See ih ee ALA (21.54) 
ji Mi Hı H2 HN, 


Notice the difference from the definition of a finite fuzzy set in (21.2). We can see 


that each rule is valid to a certain extent. A fuzzy inference is the combination 


of all the possible consequences. The action coming from a fuzzy inference is also 
N, Ai 

i=1 py 
The inference procedure depends on fuzzy reasoning. This result can be further 


a fuzzy category, which can be syntactically represented by action = > 


processed or transformed into a crisp value. 
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Fuzzification and defuzzification 


Fuzzification is to transform crisp inputs into fuzzy subsets. Given crisp inputs 
zi, i=1,...,n, fuzzification is to construct the same number of fuzzy sets A‘, 


A’ = fuzz (xi), (21.55) 


where fuzz(-) is a fuzzification operator. Fuzzification is determined according to 
the defined membership functions. 

Defuzzification maps fuzzy subsets of real numbers into real numbers. It is 
applied after aggregation. Defuzzification is necessary in fuzzy controllers, since 
the machines cannot understand control signals in the form of a complete fuzzy 
set. Popular defuzzification methods include the centroid defuzzifier [20] and the 
mean-of-maxima defuzzifier [20]. 

The best-known centroid defuzzifier finds the centroid of the area surrounded 
by the membership function and the horizontal axis. A discrete centroid defuzzi- 
fier is given by 


K 
defuzz(B) = Dees Hs (Wi) Yi (21.56) 
iar HB (Yi) 
where K is the number of quantization steps by which the universe of discourse 
Y of the membership function ug(y) is discretized. 
Aggregation and defuzzification can be combined into a single phase, such as 
the weighted-mean method [11] 


Nr 
defuzz(B) = Dasa Mibi (21.57) 
int Mi 
where N, is the number of rules, u; is the degree of activation of the ith rule, and 
bi is a numerical value associated with the consequent of the ith rule, 6;. The 
parameter b; can be selected as the mean value of the a-level set when a = pi 
[11]. 


Fuzzy models 
Given a set of N examples { (£p, Yp) [2p E R”, Yp E RPN the underlying system 


can be identified by using some fuzzy models. Two popular fuzzy inference system 
models are the Mamdani and TSK models. 


Mamdani model 


For the Mamdani model with N, rules, the ith rule is given by 


R: IF æ is A;, THEN y is B, i=1,..., Np, 
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where A; = {A}, A?,---,A?}, Bi = {B}, B?,...,B™}, and AJ, BF are, respec- 
tively, fuzzy sets that define an input and output space partitioning. 
For an n-tuple input in the form of “a is A”, the output “y is B’” is charac- 
terized by combining the rules according to 
Ny 
ug (y) = V (wa, (æ) A ne; (y)) ; (21.58) 
i=1 
where A’ = JAT, Am ex AnI B' = Sia B,- BMN and A, B are, 
respectively, fuzzy sets that define an input and output space partitioning, 


Ma (©) = uy (£) A pa (£) = /\ (na ^ na) ; (21.59) 


pa (æ) = Nj- Mas and pa, (£) = Nja Haj being, respectively, the membership 
degrees of æ to the fuzzy sets A’ and A;, ug: (y) = Ap, ge is the membership 
degree of y to the fuzzy set B;, uyy is the association between the jth input of 
A’ and the ith rule, jugs is the association between the kth input of B and the 
ith rule, A is the intersection operator, and V is the union operator. 

Minimum and product are the most common intersection operators. When 
minimum and maximum are, respectively, used as the intersection and union 
operators, the Mamdani model is called a maz-min model. Kosko’s standard 
additive model [18] has the same rule form, but it uses the product operator 
for the fuzzy intersection operation, and sup-product as well as addition as the 
composition operators. 

We now illustrate the inference procedure for the Mamdani model. Assume 
that we have a two-rule Mamdani system with the rules of the form 


Ri: IF zı is A; and x2 is B;, THEN y is C;, for i = 1,2. 


When the max-min composition is employed, for the inputs “x, is A” and “x 
is B”, the fuzzy reasoning procedure for the output y is illustrated in Fig. 21.7. 
When two crisp inputs x and «x are fed, the derivation of the output y’ is 
illustrated in Fig. 21.8. As a comparison with Fig. 21.7, Fig. 21.9 illustrates 
the result when the max-product composition is used to replace the max-min 
composition. A defuzzification strategy is needed to get a crisp output value. 

The Mamdani model offers a high semantic level and a good generalization 
capability. It contains fuzzy rules built from expert knowledge. However, fuzzy 
inference systems based only on expert knowledge may result in insufficient accu- 
racy. For accurate numerical approximation, the TSK model can usually generate 
better performance. 
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Figure 21.7 The inference procedure of the Mamdani model with the max-min composition and fuzzy 
inputs. 
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Figure 21.8 The inference procedure of the Mamdani model with the max-min composition and crisp 
inputs. 
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Figure 21.9 The inference procedure of the Mamdani model with the max-product composition and 
fuzzy inputs 
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Figure 21.10 The inference procedure for the TSK model with the min or product operator. 


21.9.2 


Takagi-Sugeno-Kang model 


In the TSK model [36], for the same set of examples {(ap,y,)}, fuzzy rules are 
given in the form 


Ri: IF æ is A;, THEN y=f,(z), i=1,2,..., Np, 


f? (æ) is typically selected as a linear relation of æ 
Fi(@) = aig +a} 121 +... + af pEr, (21.60) 


with al, k =0,1,...,n, being adjustable parameters. 
For an n-tuple input in the form of “æ is A”, the output y’ is obtained by 
combining the rules according to 


Nr 
y' = Dik HA a) file) (21.61) 


eas MA’, (x) 

where uy, (æ) is defined by (21.59). This model produces a real-valued function, 
and it is essentially a model-based fuzzy control method. The stability analysis 
of the TSK model is given in [38]. When f/(-) are first-order polynomials, the 
model is termed the first-order TSK model, which is the typical form of the TSK 
model. When Fis are constants, it is called the zero-order TSK model, which 
can be viewed as a special case of the Mamdani model. 

Similarly, we illustrate the inference procedure of the TSK model. Given a 
two-rule TSK fuzzy inference system with the rules of the form 


R;: IF x, is A; and zə is B;, THEN y = f (a1, £2), for i = 1,2. 
When two crisp inputs x and xh are fed, the inference for the output y’ is as 
illustrated in Fig. 21.10. 


In comparison with the Mamdani model, the TSK model, which is based on 
automatic learning from the data, can accurately approximate a function using 
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fewer rules. It has a stronger and more flexible representation capability than 
the Mamdani model. In the TSK model, rules are extracted from the data, but 
the generated rules may be meaningless to experts. The TSK model has found 
more successful applications in building fuzzy systems. 


Complex fuzzy logic 


Complex fuzzy sets and logic are mathematical extensions of fuzzy sets and logic 
from the real domain to the complex domain [35, 34]. A complex fuzzy set S is 
characterized by a complex-valued membership function, and the membership 
degree of any element x in S is given by a complex value of the form 


usla) = rs(x)eF?s( , (21.62) 


where the amplitude rs(x) € [0,1], and ys is the phase, that is, ws(a) is within 
a unit circle in the complex plane. 

In [35, 34], basic set operators for fuzzy logic have been extended for the com- 
plex fuzzy logic, and some additional operators, such as the vector aggregation, 
set rotation and set reflection, are also defined. The operations of intersection, 
union and complement for complex fuzzy sets are defined on the modulus of the 
complex membership degree without consideration of its phase information. In 
[8], the complex fuzzy logic is extended to a logic of vectors in the plane, rather 
than scalars. 

Complex fuzzy sets are superior to the Cartesian products of two fuzzy sets. 
Complex fuzzy logic maintains both the advantages of fuzzy logic and the proper- 
ties of complex fuzzy sets. In complex fuzzy logic, rules constructed are strongly 
related and a relation manifested in the phase term is associated with complex 
fuzzy implications. In a complex fuzzy inference system, the output of each rule 
is a complex fuzzy set, and phase terms are necessary when combining multiple 
rules so as to generate the final output. 

The fuzzy complex number [3] is a different concept from the complex fuzzy set 
[35]. The fuzzy complex number was introduced by incorporating the complex 
number into the support of the fuzzy set. A fuzzy complex number is a fuzzy 
set of complex numbers, which have real-valued membership degree in the range 
[0,1]. The operations of addition, subtraction, multiplication and division for 
fuzzy complex numbers are derived using the extension principle, and closure of 
the set of fuzzy complex numbers is proved under each of these operators. To 
sum up, a fuzzy complex number is a fuzzy set in one dimension, while a complex 
fuzzy set or number is a fuzzy set in two dimensions. 
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Possibility theory 


Fuzzy logic can be treated as a possibility theory. A possibility of 0.5 for element 
x may imply a 0.5 degree of evidence or belief that x belongs to a certain set. In 
possibility theory, possibility distribution is analogous to the notion of probability 
distribution in probability theory. 

Let V bea variable that takes values on some element x of X and A be a fuzzy 
set, defined by a(x). The possibility of V being assigned x, my (x), is w(x). 
Possibility distributions are fuzzy sets. 

Both fuzzy theory and probability theory express uncertainty, and use vari- 
ables having values in the range of [0,1]. The membership function ju4() is 
defined as a possibility distribution function for set A on the universal set X. 
Possibility measures are softer than probability measures, and their interpreta- 
tions are different. Probability has a frequentistic interpretation: it quantifies the 
frequency of occurrence of an event. It is defined on a sample space S and must 
sum up to one. Possibility has a context-dependent interpretation: it quantifies 
the meaning of an event. It is defined on a universal set X but there is no limit 
for the sum. Probability is an objectivistic measure, while possibility or fuzzy 
measures are subjectiveistic measures. 

The probability /possibility consistency principle [46] states that possibility is 
an upper bound for probability. That is, the possibility (A) and probability 
P(A) of an event A have the relation: u( A) > P(A). If an event is not possible, 
it is not probable. 


Example 21.2: Suppose the proposition: “Cynric has æ siblings, x € N = 
{1,2,3,4,...8}”. Both probability distribution and possibility distribution can 
be used to define x in M. The probability and possibility of having x siblings 
are denoted by P(x) and uyn (x), respectively. The set M is considered as a sam- 
ple space in the probability distribution and as a universal set in the possibility 
distribution. 


x 1 2 3 4 5 6 
P(t) 05 03 015 0.05 0 0 
waz) 10 10 08 03 O1 0 


= E=] E. 
oj CO} Cc 


From the above table we see that the sum of the probabilities is 1 but that 
of the possibilities is greater than 1. We can see that higher possibility does not 
always mean higher probability, but lower possibility leads to lower probability. 


Definition 21.35 (Crisp probability of a fuzzy event). Let event A be a 
fuzzy event or a fuzzy set considered in space R”, A = { (x, wa(x))|u € R” }. The 
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probability for A is defined by 


P(A) = f uadP, (21.63) 
A 
and alternatively, 
P(A) = X` wala) P(a). (21.64) 
TEA 


Given a fuzzy event in the sample space S, A = { (x, ya (x))|x € S}. The a-cut 
set of the event (set) A is given as Aa = {x|ua (x) > a}. The probability of the 
a-cut event is given by 


P(Aa)= YP). (21.65) 

teAg 
Here, Aa is the union of mutually exclusive events. The probability of Aa is 
the sum of the probability of each event in Aa. For the probability of the a-cut 
event, we can say that the possibility of the probability of Aa being P(Aq) is a. 


Definition 21.36 (Fuzzy probability of a fuzzy event). Let fuzzy event A, 
its a-cut event Aa and the probability P(Aq) be defined from the above procedure. 
The fuzzy probability of A is defined by 


P(A) = {(P(Aa), ala € [0, 1]}. (21.66) 


Case-based reasoning 


Experience-based reasoning is a widely used reasoning paradigm based on logical 
arguments. It models human reasoning. Case-based reasoning is only a method of 
experience-based reasoning. It relies on using encapsulated prior experiences as a 
basis for dealing with similar new situations, i.e., a case-based reasoner solves new 
problems by adapting solutions that were used to solve old problems. Case-based 
reasoning is a kind of similarity-based reasoning from a logical viewpoint. The 
system consists of case bases, each of which consists of the previous encountered 
problem and its solution, and experience-based reasoning. 

There are five different types of case-based reasoning systems, and although 
they share similar features, each of them is more appropriate for a particular type 
of problem [1]: exemplar-based, instance-based, memory-based, analogy-based, 
and typical case-based reasoning. Case-based reasoning systems are normally 
used in problems for which it is difficult to define rules. 

The traditional process models of case-based reasoning are the R4 model [1] 
and the problem space model [14]. Both models basically describe the major 
process stages for performing case-based reasoning, i.e., case retrieval, case reuse 
and case adaptation. In addition, the R* model stresses the cyclic feature of case- 
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based reasoning; problem space model emphasizes the solution obtained based 
on two different types of similarity in the problem space and the solution space. 

Case retrieval is a process of finding and retrieving a case or a set of cases 
in the case base that is considered to be similar to the current problem. Case 
adaptation is a process in which the solutions of previous similar cases with 
successful outcomes are modified to suit the current case, bearing in mind the 
lessons from previous similar cases with unsuccessful solutions. Case updating is 
a process of revising cases or insertion of new cases in the case base. 

Fuzzy set-based models in case-based reasoning and a logical formalization of 
the basic case-based reasoning inference are proposed in [9]. A logical approach 
to case-based reasoning using fuzzy similarity relations is proposed in [33], based 
on the graded consequence relations named approximation entailment and prox- 
imity entailment. A unified logical foundation for the case-based reasoning cycle 
is based on an integration of traditional mathematical logic, fuzzy logic and 
similarity-based reasoning [12]. 


Granular computing and ontology 


Humans tend to think on granular, abstract levels rather than on the level of 
detailed and precise data. Too much information may cause an information over- 
load that reduces the quality of human decisions and actions [7]. Granular com- 
puting approximates detailed machine-like information by a coarser presenta- 
tion on a human-like level. Within granular computing, continuous variables 
are mapped into intervals for the formulation of linguistic variables. Important 
approaches to granular computing are fuzzy sets, interval regression [37] and 
granular box regression [32], rough sets [27, 28], soft sets [25], and shadowed sets 
[29]. 

The theory of rough sets [27, 28] was proposed by Pawlak in 1982 as a math- 
ematical tool for managing ambiguity, vagueness and general uncertainty that 
arise from granularity in the universe of discourse. It can be approached as an 
extension to the classical theory of sets. It is a framework for the construction 
of approximations of concepts when only incomplete information is available. 
The objects of the universe of discourse U, called rough sets, can be identified 
only within the limits determined by the knowledge represented by a given indis- 
cernibility relation, which defines a partition in U. A rough set is an imprecise 
representation of a concept (set) in terms of a pair of subsets, namely a lower 
approximation and an upper approximation. The approximations themselves can 
be crisp, imprecise or fuzzy. The lower approximation is the set of objects defi- 
nitely belonging to the vague concept, whereas the upper approximation is the 
set of objects possibly belonging to the vague concept. These approximations 
are used to define the notions of discernibility matrices, discernibility functions, 
reducts and dependency factors, all of which are necessary for reduction of knowl- 
edge [24]. Hybridizations for rule generation and exploiting the characteristics of 
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rough sets with neural, fuzzy, and evolutionary approaches are available in the 
literature [24]. Fuzzy rough sets [39] are a generalization of rough sets to deal 
with both fuzziness and vagueness in data. 

Fuzzy sets tend to capture vagueness exclusively through membership values. 
This poses a dilemma of excessive precision in describing imprecise phenomenon. 
The notion of shadowed sets [29] tries to solve this problem of selecting the opti- 
mum level of resolution in precision. Shadowed sets can be sought as a symbolic 
representation of numeric fuzzy sets [30]. Three quantification levels that are 
elements of the set {0, 1, [0, 1]} are utilized to simplify the relevant fuzzy sets 
in shadowed set theory. Conceptually, shadowed sets are close to rough sets. 
The concepts of negative region, lower bound and boundary region in rough 
set theory correspond to three-logical values 0, 1 and [0,1] in shadowed sets, 
namely excluded, included and uncertain, respectively. In this sense, shadowed 
sets bridges fuzzy and rough sets. 

Ontology is defined as that branch of metaphysics concerned with the “nature 
of being”. It typically has a more restricted definition, namely “a working model 
of entities and interactions.” Ontologies, as sets of concepts and their interrela- 
tions in a specific domain, are a useful tool in the areas of digital libraries, the 
semantic web, and personalized information management. Many different kinds 
of semantic or linguistic relations can be defined between terms or concepts, such 
as synonymy, hypernymy (is-a), meronymy (part-of) relations. These relations 
and their representations are more formal than in a taxonomy, since ontologies 
are generally used to model complex knowledge about the real world and to infer 
additional knowledge. 


21.1 Describe the state of an environment by quantifying temperature as very 
cold, cold, cool, comfortable, warm, hot and very hot. Define an approximate 
universal of discource. Represent state values using: (a) sets and (b) fuzzy sets. 


21.2 Show that tv(P or Q) = C(tv(P),tv(Q)) for any t-conorm, where tv 
denotes true value. 


21.3 Prove the relations given by (21.28) and (21.34). 
21.4 List the truth table for the logical operator P —> Q. 


21.5 Consider fuzzy sets A and $, defined by membership functions 


1 1 


TIF- 0 TIF GA 


(a) Calculate the union and intersection of A and B, and the complements of A 
and B. Plot their membership functions. 
(b) Plot the membership function of the product of A and B. 


Ma(x) 
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21.6 Fuzzy set A describes “Temperature is higher than 35°C” by using 


1 
palz) i 2 <35 


and fuzzy set B describes “Temperature is approximately 38°C” by using 
1 


Lp(z) = 1+ (@—38) 


(a) Define a fuzzy set C: “Temperature is higher than 35 and appromately equal 
to 38”. 

(b) Draw the membership function for C, by using three t-norms for representing 
and. 

(c) Do the same when and is changed to or. 


21.7 Consider the following fuzzy sets defined on the finite universe of discourse 
X = {1,2,3,4,5}: 


A = {0,0.1, 0.3, 0.4, 1}, B = {0,0.1, 0.2, 0.3, 1}, C = {0.1, 0.2, 0.4, 0.6, 0}. 
(a) Verify the relations B C A and C C B. 
(b) Give the cardinalities of A, B, and C. 


21.8 Show that the Yager negation function N(x) = (1 — z)!/®, w € (0,00), 
satisfies the definition for negation. 


21.9 Develop a rule-based model to approximate f(x) = x? + x + 4 in the inter- 
val x € [—1,3]. 


21.10 Ina classical example given by Zadeh [46], the proposition “Hans ate V 
eggs for breakfast,” where V € {1,2,...}. A possibility distribution my (a) and a 
probability distribution py (x) are associated with V. 

(a) For each z, propose your distributions. 

(b) What is the difference between the two distributions? 


21.11 The membership matrices of the relations R; on X x Y and Rz on Y x Z 


are 
1.0 0.4 0.8 0.0 1.0 0.9 0.8 
0.4 1.0 0.6 1.0 1.0 0.1 0.6 
Ri =108061.00.7|’ T25 |0.50.40.0 
0.0 1.0 0.7 1.0 0.1 0.3 0.2 


Calculate the max-min composition Rı o Rə. 


21.12 What type of fuzzy controller/model do you prefer, Mamdani-type or 
TSK-type? Justify your choice. Build some Mamdani fuzzy rules and TSK fuzzy 
rules of your own. 


21.13 Show that the following two characteristics of crisp set operators do not 
hold for fuzzy sets: 
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21.14 A property unique to fuzzy sets (not for crisp sets) is AN Ø = Ø. Justify 
this statement. 


21.15 Show that the two fuzzy sets satisfy De Morgan’s law: 


1 ae 
T+(a—8)) MPU TIF? 





palz) = 


21.16 Consider the definition of fuzzy implication. 

(a) Show that I(x,1) = 1, Vx € [0,1]. 

(b) Show that the restriction of I to {0,1}? coincides with the classical material 
implication. 


21.17 Show that the following two implication functions satisfy the definition 
of fuzzy implication. 

(a) The Lukasiewicz implication: I(x, y) = min(1,1— x + y). 

(b) The Zadeh implication: I(x, y) = max(1 — x, min(z, y)). 

(c) Plot these functions. 


21.18 Show that the Godel and Goguen (also known as product) implications, 


_filife<y fil, ifs<y 


given by 


are R-implications derived from the t-norms minimum and product, respectively. 


21.19 Using the definition of fuzzy composition of two fuzzy relations, verify 
that 


(Ri(a,y) o Roly, z))" = R3 (z, y) o Ry" (y, 2), 
(Rı(X, Y) o Ro(Y, Z)) o R(Z,W) = Rı(X,Y ) o (Rə(Y, Z) o R(Z,W)) 
are similar to crisp binary relations. 


21.20 Assume that 


03100902 19040802 
Ri(X,Y) = | 1.00.20.004], RaZ)=| 01040601 
0.5 0.8 0.6 1.0 oeri 


Compute Rı o Rə. 


21.21 Consider a probability distribution P(x), and two events A 
and B: P(a) = 0.2, P(b) = 0.4, P(c) = 0.3, P(d) = 0.1; A = {a,b,c}; 
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B = {(a,0.4), (b, 0.8), (c, 0.9), (d,0.2)}. 

(a) Find the probability of event A. 

(b) Find the crisp probability of fuzzy event B. 
(c) Find the fuzzy probability of fuzzy event B. 
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22.1 


Neurofuzzy systems 


Introduction 


The neurofuzzy system is inspired by the biological-cognitive synergism in human 
intelligence. It is the synergism between the neuronal transduction/processing 
of sensory signals, and the corresponding cognitive, perceptual and linguistic 
functions of the brain. 

As an example, we describe how the human is aware of the ambient tem- 
perature. The human skin has two kinds of temperature receptors to sense the 
temperature: one for warm and the other for cold [62]. Neural fibers from numer- 
ous temperature receptors enter the spinal cord to form synapses. These neural 
signals are passed forward to medial and ventrobasal thalamus in the lower part 
of the brain, and then further carried to the cerebral cortex. The process can be 
modelled by a neural network. On the cerebral cortex, the temperature outputs 
are fused, and expressed linguistically as cold, warm, or hot. This part can be 
modelled by a fuzzy system. Based on this knowledge, one can make decision on 
whether to put on more clothes or turn off the air-conditioner. 

Hybridization of fuzzy logic and neural networks yields neurofuzzy systems, 
which capture the merits of both paradigms. Existing neurofuzzy systems are 
mainly customized for clustering, classification and regression. The learning capa- 
bility of neural networks is exploited to adapt the knowledge base from a given 
data, and this work is traditionally conducted by human experts. The appli- 
cation of fuzzy logic endows neural networks with the capability of explaining 
their actions. Neurofuzzy models usually achieve a faster convergence speed with 
a smaller network size, compared to neural networks. Interpretability and accu- 
racy are contradictory requirements: While interpretability is the capability to 
express the behavior of the real system in an understandable way, accuracy is 
the capability to represent faithfully the real system. A tradeoff between the two 
edges must be achieved. 

Both neural networks and fuzzy systems are dynamic, parallel distributed pro- 
cessing systems that estimate functions. Many neural networks and fuzzy systems 
are universal approximators. They estimate a function without any mathemati- 
cal model and learn from experience with sample data. From the point of view of 
an expert system, fuzzy systems and neural networks are quite similar as infer- 
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ence systems. An inference system involves knowledge representation, reasoning 
and knowledge acquisition: 


e A trained neural network represents knowledge using connection weights and 
neurons in a distributed manner. In a fuzzy system, knowledge is represented 
using IF-THEN rules. 

e When an input is presented to a neural network, an output is generated. This 
is a reasoning process. In a fuzzy system, reasoning is logic-based. 

e Knowledge acquisition is via learning in a neural network, while in a fuzzy 
system knowledge is encoded by a human expert. 


Fuzzy systems can be applied to problems with knowledge represented in the 
form of IF-THEN rules. Problem-specific a priori knowledge can be integrated 
into the systems. Training pattern set and system modeling are not needed, and 
only heuristics are used. During the tuning process, one needs to add, remove, or 
change a rule, or even change the weight of a rule, using knowledge of experts. On 
the other hand, neural networks are useful when we have training pattern set. A 
trained neural network is a black box that represents knowledge in its distributed 
structure. However, any prior knowledge of the problem cannot be incorporated 
into the learning process. It is difficult for human beings to understand the 
internal logic of the system. By extracting rules from neural networks, users can 
understand what neural networks have learned and how they predict. 


Interpretability 


A motivation for using fuzzy systems is due to their interpretability. Inter- 
pretability helps to check the plausibility of a system, leading to easy main- 
tenance of the system. It can also be used to acquire knowledge from a problem 
characterized by numerical examples. An improvement in interpretability can 
enhance the performance of generalization when the data set is small. 

The interpretability of a rule base is usually related to continuity, consistency 
and completeness [26]. Continuity guarantees that small variations of the input 
do not induce large variations in the output. Consistency means that if two or 
more rules are simultaneously fired, their conclusions are coherent. Completeness 
means that for any possible input vector, at least one rule is fired and there is no 
inference breaking. Two neighboring fuzzy subsets in a fuzzy partition overlap. 

When neurofuzzy systems are used to model nonlinear functions described by 
training sets, the approximation accuracy can be optimized by the learning pro- 
cedure. However, since learning is accuracy oriented, it usually causes a reduction 
in the interpretability of the generated fuzzy system. The loss of interpretabil- 
ity can be due to [41]: incompleteness of fuzzy partitions, indistinguishability of 
fuzzy partitions (subsets), inconsistancy of fuzzy rules, too fuzzy or too crisp 
fuzzy subsets, and non-compactness of the fuzzy system. 

To improve the interpretability of neurofuzzy systems, one can add to the 
cost function, regularization terms that apply constraints on the parameters of 
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fuzzy membership functions. For example, the order of the centers of all the 
fuzzy subsets A’, which are partitions of the fuzzy set A, should be specified 
and remain unchanged during learning. Similar membership functions should be 
merged to improve the distinguishability of fuzzy partitions and to reduce the 
number of fuzzy subsets. One can also reduce the number of free parameters 
in defining fuzzy subsets. To increase the interpretability of the designed fuzzy 
system, the same linguistic term should be represented by the same membership 
function. This results in weight sharing [61]. For the TSK model, one practice 
for good interpretability is to keep the number of fuzzy subsets much smaller 
than the number of fuzzy rules N,., especially when N, is large. 


Rule extraction from trained neural networks 


Fuzzy rules and multilayer perceptrons 


There are many techniques for extracting rules from trained neural networks 
[32, 11, 38, 7, 9, 47]. This leads to the functional equivalence between neural 
networks and fuzzy rule based systems. 

For a three-layer MLP with ¢“) (-) as the logistic function and ¢) (-) as the lin- 
ear function, there always exists a fuzzy additive system that calculates the same 
function as the network does [7]. In [7], a fuzzy logic operator, called interactive- 
or (i-or), is defined by applying the concept of f-duality to the logistic function. 
The use of the i-or operator explains clearly the acquired knowledge of a trained 
MLP. The i-or operator is defined by [7] 

a-b 
a®@b= fen ienr (22.1) 
The i-or operator works on (0,1). It is a hybrid between both a t-norm and 
a t-conorm. Based on the i-or operator, the equality between MLPs and fuzzy 
inference systems has been established [7]. The equality proof also yields an 
automated procedure for knowledge acquisition. An extension of the method has 
been presented in [9]. 

In [22], relations between input uncertainties and fuzzy rules have been estab- 
lished. Sets of crisp logic rules applied to uncertain inputs have been shown to be 
equivalent to fuzzy rules with sigmoidal membership functions applied to crisp 
inputs. Crisp logic and fuzzy rule systems have been shown to be, respectively, 
equivalent to the logical network and the three-layer MLP. Keeping fuzziness on 
the input side enables easier understanding of the networks or the rule systems. 

In [11, 76], MLPs are interpreted by fuzzy rules in such a way that the sig- 
moidal activation function is decomposed into three TSK fuzzy rules with one 
TSK fuzzy rule for each partition. An algorithm for rule extraction given in 
[11] extracts O(N) rules for N examples. Rule generation from a trained neural 
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network can be done by analyzing the saturated zones of the fuzzy activation 
functions [76]. 

A fuzzy set is usually represented by a finite number of its supports. In compar- 
ison with conventional membership function based fuzzy inference systems, a-cut 
based fuzzy inference systems [82] can considerably reduce the required memory 
and time complexity, since they depend on the number of membership-grade lev- 
els, and not on the number of elements in the universes of discourse. Secondly, 
the inference operations can be performed for each a-cut set independently, and 
this enables parallel implementation. An a-cut based fuzzy inference system can 
also easily interface with two-valued logic since the a-level sets themselves are 
crisp sets. In addition, fuzzy set operations based on the extension principle can 
be performed efficiently using a-level sets [82]. For a-cut based fuzzy inference 
systems, fuzzy rules can be learned by an MLP with the BP rule. 


Fuzzy rules and RBF networks 


The normalized RBF network is found functionally equivalent to a class of TSK 
systems [35]. For the convenience of presentation, we reproduce the output of 
the Jı-J2-J3 normalized RBF network given by (10.60): 


J2 . Se 
Yj = er wijo (|x cill) j = 1, . ., J3. (22.2) 


X2 ¢ (Ile — e:ll) 

When the t-norm in the TSK model is selected as algebraic product and the 
membership functions are selected as RBFs, the two models are mathematically 
equivalent [35]. Note that each hidden unit corresponds to a fuzzy rule. The 
normalized RBF network provides a localized solution that is amenable to rule 
extraction. The receptive fields of some RBFs should overlap to prevent incom- 
pleteness of fuzzy partitions. 

To have a perfect match between ¢ (||a — e;||) in (22.2) and ux, (x) in (21.61), 
one is required to select factorizable ¢ (||a — ¢;||) such that 


Jı Jı 
yay (@) = [[ uy (es) = olle- cl) = [e-e 223) 
j=1 j=l 
Each component ¢ (|x; — ci, j|) corresponds to a membership function ju ai Note 
that the Gaussian RBF is the only strictly factorizable function. 

In the normalized RBF network, wijs typically take constant values and the 
normalized RBF network corresponds to the zero-order TSK model. When the 
RBF weights are linear regression functions of the input variables, the model is 
functionally equivalent to the first-order TSK model. 

In a practical implementation of the TSK model, one can select some u Ai = 1 
or some Hyj = Hyj in order to increase the distinguishability of the fuzzy par- 
titions. Correspondingly, one should share some component RBFs or set some 
component RBFs to unity. This considerably reduces the effective number of free 
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parameters in the RBF network. When implementing component RBF or mem- 
bership function sharing, a Euclidean-like distance measure is used to describe 
the similarity between two component RBFs. A gradient-descent procedure is 
conducted so as to extract interpretable fuzzy rules from a trained RBF network 
[41]. 

A fuzzy system can be first constructed according to heuristic knowledge and 
existing data, and then converted into an RBF network. This is followed by a 
refinement of the RBF network using a learning algorithm. Due to this learning 
procedure, the interpretability of the original fuzzy system may be lost. The RBF 
network is then again converted into interpretable fuzzy system, and knowledge 
is extracted from the network. This process refines the original fuzzy system 
design. 

The fuzzy basis function network [86] has a structure similar to that of the 
RBF network. It is also based on the TSK model. It is capable of uniformly 
approximating any continuous nonlinear function to a specified accuracy with a 
finite number of basis functions. It can readily adopt various learning algorithms 
developed for the RBF network. An incremental learning algorithm for the fuzzy 
basis function network is introduced in [57]. In the sequential adaptive fuzzy 
inference system [71], using the concept of influence of a fuzzy rule fuzzy rules 
are added or removed based on the input data received so far, in a way similar 
to GAP-RBFN for hidden neurons. 


Rule extraction from SVMs 


Rules can be extracted from trained SVMs. By using support vectors from a 
trained SVM, it is possible to use any RBF network learning technique for rule 
extraction, while avoiding the overlapping problem between classes [66]. Merging 
node centers and support vectors explanation rules can be obtained in the form 
of ellipsoids and hyper-rectangles. 

In decompositional approach, rejoins are formed in the input space, utilizing 
the SVM decision functions and the support vectors, which are then mapped to 
rules. Three types of rejoins are formed: ellipsoids [65], hyperrectangles [65], and 
hypercubes [23]. 

The SVM+ prototype method [65] utilizes a clustering algorithm to determine 
prototype vectors for each class, which are then used together with the support 
vectors to define ellipsoid and hyperrectangle regions in the input space. Ellip- 
soids are then mapped to IF-THEN rules. This iterative procedure first trains 
an SVM model, which divides the training data in two subsets: those with pos- 
itive predicted class and those with negative predicted class. For each of these 
subsets, clusters are generated. Based on the cluster prototype and the farthest 
support vector, interval or ellipsoid rules can be created. The rules extracted by 
this method are of high accuracy and fidelity; however, it produces a relatively 
large number of rules. 
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Rule extraction from linear SVMs or from any of the hyperplane-based linear 
classifiers is approached based on an LP formulation of SVMs with linear kernels 
[23]. Each rule extracted defines a hypercube, which is a subset of one of the 
bounded regions and must have one vertex that lies on the separating hyperplane 
for the rules to be disjoint. The method is decompositional as it is only applicable 
when the underlying model provides a linear decision boundary. 

The decompositional rule extraction technique SQRex-SVM [5] extracts rules 
directly from the support vectors of a trained SVM using a modified sequential 
covering algorithm. Rules are generated based on an ordered search of the most 
discriminative features, as measured by interclass separation. After training the 
SVM model, only the support vectors that are correctly classified are used in the 
next rule generation steps. 

An active learning based approach given in [58] extracts rules from the trained 
SVM model by explicitly making use of key concepts of SVM. By focusing on the 
input regions close to the decision boundary, better discrimination power can be 
obtained. 

In [94], parsimonious L2-SVM based fuzzy classifiers are constructed consid- 
ering model selection and feature ranking performed simultaneously, in which 
fuzzy rules are generated from data by L2-SVM learning. As a prototype-based 
classifier, the D2-SVM fuzzy classifier has the number of support vectors that 
equals the number of induced fuzzy rules. 

An exact representation of SVMs as TSK fuzzy systems is given for every used 
kernel function in [10]. The behavior of SVMs is explained by means of fuzzy 
logic and the interpretability of the system is improved by introducing the A-fuzzy 
rule-based system (A-FRBS). A-FRBS exactly approximates the SVM’s decision 
boundary, and its rules and membership functions are very simple, aggregating 
the antecedents with uninorms as compensation operators. The rules of A-FRBS 
are limited to two and the number of fuzzy propositions in each rule only depends 
on the cardinality of the set of support vectors. For that reason, A-FRBS over- 
comes the curse of dimensionality. 


Rule generation from other neural networks 


Rule generation encompasses both rule extraction and rule refinement. Rule 
extraction is to extract knowledge from trained neural networks using the net- 
work parameters, while rule refinement is to refine the rules that are extracted 
from neural networks and initialized with crude domain knowledge. 

Rule extraction from trained neural networks can be categorized into three 
main families: learning-based, decompositional and eclectic. Learning-based 
approaches treat the model as a black-box describing only the relationship 
between the inputs and the outputs. Decompositional approaches open the 
model, look into its individual components, and then attempt to extract rules 
at the level of these components. The eclectic approach lies in between the two 
families. 
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Feedforward networks generally do not have the capability to represent recur- 
sive rules when the depth of the recursion is not known a priori. Recurrent 
networks have the ability to store information over indefinite periods of time, to 
develop hidden states through learning, and thus to conveniently represent recur- 
sive linguistic rules [59]. They are particularly well suited for problem domains, 
where incomplete or contradictory prior knowledge is available. In such cases, 
knowledge revision or refinement is also possible. 

Discrete-time recurrent networks have been used to correctly classify strings of 
a regular language [67]. Recurrent networks are suitable for crisp/fuzzy grammat- 
ical inference. For rule extraction from recurrent networks, the recurrent network 
is transformed into an equivalent deterministic finite-state automata by applying 
clustering algorithms in the output space of neurons. An augmented recurrent 
network that encodes fuzzy finite-state automata and recognizes a given fuzzy 
regular language with an arbitrary accuracy has been constructed in [68]. The 
granularity within both extraction techniques is at the level of ensemble of neu- 
rons, and thus, the approaches are not strictly decompositional. Rule extraction 
from recurrent networks aims to find models of a recurrent network, typically 
in the form of finite state machines. This is carried out using four steps [34]: 
quantization of the continuous state space of the recurrent network, resulting 
in a discrete set of states; state and output generation by feeding the recurrent 
network with input patterns; construction of the corresponding deterministic 
finite-state automaton, based on the observed transitions; and minimization of 
the deterministic finite-state automaton. 

In the all-permutations fuzzy rule base method [47], the input-output map- 
ping of a specific fuzzy rule base is a linear sum of sigmoidal functions. This 
Mamdani-type fuzzy model is shown to be mathematically equivalent to stan- 
dard feedforward network. It was used to extract and insert symbolic informa- 
tion into feedforward networks [47]. The method is also used to extract symbolic 
knowledge from recurrent networks [48]. 

Rule extraction has also been carried out on Kohonen networks [83]. A com- 
prehensive survey on rule generation from trained neural networks has been pro- 
vided in [59], where the optimization capability of evolutionary algorithms are 
emphasized for rule refinement. An overview of rule extraction from recurrent 
networks is given in [34]. 


Extracting rules from numerical data 


Fuzzy inference systems can be designed directly from expert knowledge and 
data. The design process is usually decomposed into two phases, namely rule 
generation and system optimization [26]. Rule generation leads to a basic system 
with a given space partitioning and the corresponding set of rules, while system 
optimization can be the optimization of membership parameters and rule base. 
Design of fuzzy rules can be conducted in one of three ways, namely, all possible 
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Figure 22.1 Partitioning of the two-dimensional input space. (a) Grid partitioning. (b) k-d tree 
partitioning. (c) Multilevel grid partitioning. (d) Scatter partitioning. 


22.3.1 


combinations of fuzzy partitions, one rule for each data pair, or dynamically 
choosing the number of fuzzy sets. 

For good interpretability, a suitable selection of variables and the reduction 
of the rule base are necessary. During the system optimization phase, merging 
techniques such as cluster merging and fuzzy-set merging are usually used for 
interpretability purposes. Fuzzy-set merging leads to higher interpretability than 
cluster merging. The reduction of a set of rules results in a loss of numerical 
performance on the training data set, but a more compact rule base has a better 
generalization capability and is also easier for human understanding. 

Methods for designing fuzzy inference systems from data are analyzed and 
surveyed in [26], with emphasis on clustering methods for rule generation and 
evolutionary algorithms on system optimization. They are grouped into several 
families and compared based on rule interpretability. 


Rule generation based on fuzzy partitioning 


For rule generation, fuzzy partitioning is used for structure identification for 
fuzzy inference systems, and a learning algorithm is then used for parameter 
identification. There are usually three methods for partitioning the input space, 
namely grid partitioning, tree partitioning and scatter partitioning. These parti- 
tioning methods in the two-dimensional input space are illustrated in Fig. 22.1. 


The grid structure has easy interpretability and is most widely used for gen- 
erating fuzzy rules. Fuzzy sets of each variable are shared by all the rules. How- 
ever, the number of fuzzy rules grows exponentially with input dimension, that 
is, the problem of curse of dimensionality. For n input variables, each being 
partitioned into m; fuzzy sets, a total of []j_, m; rules are needed to cover the 
whole input space. Since each rule has a few parameters to adjust, there are too 
many parameters to adapt during the learning process. This reduces the inter- 
pretability of the fuzzy system. The grid structure is a static structure, and is 
appropriate for a low-dimensional data set with good coverage. The performance 
of the resultant model depends entirely on the initial definition of these grids. 
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Thus, a training procedure can be applied to optimize the grid structure and the 
rule consequences. The grid structure is illustrated in Fig. 22.1a. 

k-d tree and multilevel grid structures are two hierarchical partitioning tech- 
niques [78]. The input space is first partitioned roughly, and a subspace is recur- 
sively divided until a desired approximation performance is achieved. The k-d 
tree results from a series of guillotine cuts that is entirely across the subspace to 
be partitioned. After the ith guillotine cut, the entire space is partitioned into 
i + 1 regions. A k-d tree partitioning is illustrated in Fig. 22.1b. For the multi- 
level grid structure, the top-level grid coarsely partitions the whole space into 
equal-sized and evenly-spaced fuzzy boxes, which are recursively partitioned into 
finer grids until a criterion is met. Hence, a multilevel grid structure is also called 
a box tree. The criterion can be that the resulting boxes have a similar number of 
training examples or that an application-specific evaluation in each grid is below 
a threshold. A multilevel grid partitioning is illustrated in Fig. 22.1c. A multi- 
level grid in the two-dimensional space is called a quad tree. Tree partitioning 
significantly relieves the problem of rule explosion, but it needs some heuristics 
to extract rules. 

Scatter partitioning uses multidimensional antecedent fuzzy sets. It usually 
generates fewer fuzzy regions than the grid and tree partitioning techniques 
owing to the natural clustering property of training patterns. Fuzzy clustering 
algorithms form a family of rule-generation techniques. The training examples 
are gathered into homogeneous groups and a rule is associated to each group. 
The fuzzy sets are not shared by the rules, but each of them is tailored for one 
particular rule. Thus, the resulting fuzzy sets are usually difficult to interpret 
[26]. Scatter partitioning of high-dimensional feature spaces is difficult, and some 
learning or evolutionary procedures may be necessary. A scatter partitioning is 
illustrated in Fig. 22.1d. Some clustering-based methods for extracting fuzzy 
rule for function approximation have been proposed based on the TSK model. 
Clustering can be used for identification of the antecedent part of the model such 
as determination of the number of rules and initial rule parameters. Each cluster 
center corresponds to a fuzzy rule. The consequent part of the model can be 
estimated by the LS method. Based on the Mamdani model, a clustering-based 
method for function approximation is also given in [88]. 


Other methods 


Hierarchical structure for fuzzy-rule systems can also effectively solve the rule- 
explosion problem [87, 54]. A hierarchical fuzzy system is comprised of a number 
of low-dimensional fuzzy systems connected in a hierarchical fashion. The low- 
dimensional fuzzy systems can be TSK systems, each constituting a level in the 
hierarchical fuzzy system. The total number of rules increases only linearly with 
the number of input variables. For a hierarchical fuzzy system comprised of n — 1 
two-input TSK systems, the n input variables are x;, i = 1,...,n, the output is 
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denoted by y, and y; is the output of the ith TSK system: 
y= f; (Yi-1, Li41); a=1,...,n—-1, (22.4) 


where f; is the nonlinear relation described by the ith TSK system, y; is the 
output of the ith TSK system, and yo = x1. The final output y = yn_1 is easily 
obtained by a recursive procedure. 

Hierarchical TSK systems [87] and generalized hierarchical TSK systems [54] 
have been shown to be universal approximators of any continuous function 
defined on a compact set. If there are n variables each of which is partitioned 
into m; fuzzy subsets, the total number of rules is only 5O42} mimi1. However, 
the curse of dimensionality is inherent in the system. In a standard fuzzy system, 
the degree of freedom is unevenly distributed over the IF and THEN parts of 
the rules, with a comprehensive IF part to cover the whole domain and a simple 
THEN part. The hierarchical fuzzy system, on the other hand, provides with 
an incomplete IF part but a more complex THEN part. Generally, conventional 
fuzzy systems achieve universal approximation using piecewise-linear functions, 
while the hierarchical fuzzy system achieves it through piecewise-polynomial 
functions [87, 54]. 

Designing fuzzy systems from pattern pairs is a nonlinear regression problem. 
In a simple look-up-table technique [85], each pattern pair generates one fuzzy 
rule and then a selection process determines the important rules, which are used 
to construct the final fuzzy system. The input membership functions do not 
change with the sampling data, thus the designed fuzzy system uniformly covers 
the domain of interest. The input and output spaces are first divided into fuzzy 
regions, then a fuzzy rule is generated from a given pattern pair, and finally a 
degree is assigned to each rule to resolve rule conflicts and reduce the number 
of rules. When a new pattern pair becomes available, a rule is created for this 
pattern pair and the fuzzy rule base is updated. The look-up-table technique is 
implemented in five steps in [85, 88]. The fuzzy system thus constructed is proved 
to be a universal approximator by using the Stone-Weierstrass theorem [85]. The 
approach is a simple and fast one-pass procedure. This algorithm produces an 
enormous number of rules. The problem of contradictory rules also arises, and 
noisy data in the training examples will affect the consequence of a rule. In a 
similar grid partitioning based method, each datum generates one rule [1]. 

Many other general methods can be used to automatically extract fuzzy rules 
from a set of numerical examples and to build a fuzzy system for function 
approximation, such as heuristics-based approaches [80] and hybrid neural-fuzzy 
approaches [36]. A function approximation problem can be first converted into 
a pattern-classification problem, and then solved by using a fuzzy system [80]. 
The universe of discourse of the output variable is divided into multiple inter- 
vals, each regarded as a class, and then a class is assigned to each of the training 
data according to the desired value of the output variable. The data of each 
class are then partitioned in the input space to achieve a higher accuracy in the 
approximation of the class regions until a termination criterion is satisfied. 
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Synergy of fuzzy logic and neural networks 


While neural networks have strong learning capabilities at the numerical level, 
it is difficult for the users to understand them at the logic level. Fuzzy logic, on 
the other hand, has a good capability of interpretability and can also integrate 
expert’s knowledge. Synergy of both paradigms yields the capabilities of learning, 
good interpretation and incorporating prior knowledge. 

The combination can be in different forms. The simplest form may be the 
concurrent neurofuzzy model, where a fuzzy system and a neural network work 
separately. The output of one system can be fed as the input to the other system. 
The cooperative neurofuzzy model corresponds to the case, where one system is 
used to adapt the parameters of the other system. Neural networks can be used 
to learn the membership values for fuzzy systems, to construct IF-THEN rules 
[25], or to construct a decision logic. 

The true synergy of the two paradigms is a hybrid neural/fuzzy system, which 
captures the merits of both the systems. It can be in the form of either a fuzzy 
neural network or a neurofuzzy system. A hybrid neural-fuzzy system does not 
use multiplication, addition, or the sigmoidal function. Alternatively, fuzzy logic 
operations such as t-norm and ¢-conorm are used. 

A fuzzy neural network is a neural network equipped with the capability of han- 
dling fuzzy information, where the input signals, activation functions, weights, 
and/or the operators are based on fuzzy set theory. Thus, symbolic structure is 
incorporated [69]. The network can be represented in an equivalent rule-based 
format, where the premise is the concatenation of fuzzy AND and OR logic, and 
the consequence is the network output. The fuzzy AND and OR neurons are 
defined by 


yann = A (V (w1, £1), V (w2, £2)) = T (C (w1, 21), C (we, £2)), (22.5) 


yor = V (A^ (w1, 21) , A (we, £2)) = C (T (wi, %1) , T (wa, x2)) . (22.6) 


Weights always have values in [0, 1], and negative weight is achieved by using the 
NOT operator. The weights of the fuzzy neural network can be interpreted as 
calibration factors of the conditions and rules. In [70], fuzzy logic networks for 
logic-based data analysis are treated. The networks are homogeneous architec- 
tures comprising of OR/AND neurons. The developed network realizes a logic 
approximation of multidimensional mappings between unit hypercubes, that is, 
transformations from [0, 1]” to [0, 1]™. 

A neurofuzzy system is a fuzzy system whose parameters are learned by a 
learning algorithm obtained from neural networks. It can always be interpreted as 
a system of fuzzy rules. Learning is used to adaptively adjust the rules in the rule 
base, and to produce or optimize the membership functions of a fuzzy system. 
A neurofuzzy system has a neural-network architecture constructed from fuzzy 
reasoning. Structured knowledge is codified as fuzzy rules, while the adapting 
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and learning capabilities of neural networks are retained. Expert knowledge can 
increase learning speed and estimation accuracy. 

Both fuzzy neural networks and neurofuzzy systems can be treated as neural 
networks, where the units employ t-norm or t-conorm operator instead of an 
activation function. The weights are fuzzy sets, and the neurons apply t-norm 
or t-conorm operations. The hidden layers are usually used as rule layers. The 
layers before the rule layers perform as premise layers, while those after perform 
as consequent layers. As there is no distinct borderline between a neurofuzzy 
system or a fuzzy neural network, we call both types of synergisms as neurofuzzy 
systems. When only the input is fuzzy, it is a type-I neurofuzzy system. When 
everything except the input is fuzzy, we get a type-II model. A type-II] model 
is defined as one where the inputs, weights, and shift terms are all fuzzy. 

The functions realizing the inference process are usually nondifferentiable and 
thus, the popular gradient-descent or BP algorithm cannot always be applied for 
training neurofuzzy systems. To make use of gradient-based algorithms, one has 
to select differential functions. For nondifferentiable inference functions, training 
can be performed by using evolutionary algorithms. The shape of the membership 
functions, the number of fuzzy partitions, and the rule base can all be evolved 
by using evolutionary algorithms. Roughly speaking, the neurofuzzy method is 
superior to the neural-network method in terms of the convergence speed and 
compactness of the structure. 


ANFIS model 


The ANFIS is a well-known neurofuzzy model [35, 36, 38]. The ANFIS model, 
shown in Fig. 22.2, has a six-layer (n-nk-K-K-K-1) architecture, and is a graph- 
ical representation of the TSK model. The symbol N in the circles denotes the 
normalization operator, and Œ = (£1, £2, En)”. 

Layer 1 is the input layer with n nodes. Layer 2 has nK nodes, each outputting 


the membership value of the ith antecedent of the jth rule 
of = was (24), i=1,....n,9=1,...,K, (22.7) 


where Ai defines a partition of the space of xi, and pu Ai (xi) is typically 
selected as a generalized bell membership function defined by (21.20), 4 Ai (xi) = 
p (zi; c$, a$, b$). The parameters cj, aj and bi, are referred to as premise param- 
eters. 

Layer 3 has K fuzzy neurons with the product t-norm as the aggregation 
operator. Each node corresponds to a rule, and the output of the jth neuron 
determines the degree of fulfillment of the jth rule 


oO =| [ee @), Jak (22.8) 
i= 1. 
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Figure 22.2 ANFIS: graphical representation of the TSK model. 


Each neuron in layer 4 performs normalization, and the outputs are called 
normalized firing strengths 
(4) 0j? 
J . 
o = nm GH Dig K (22.9) 
J K (3) 
kai % 


The output of each node in layer 5 is defined by 


o =P fit, Gola, (22.10) 


where f;(-) is given for the jth node in layer 5. Parameters in f;(a) are referred 
to as consequent parameters. 

The outputs of layer 5 are summed and the output of the network, o = 
Dii on gives the TSK model (21.61). 

In the ANFIS model, functions used at all the nodes are differentiable, thus 
BP can be used to train the network. Each membership function u Ai is speci- 
fied by a predefined shape and its corresponding shape parameters. The shape 
parameters are adjusted by a learning algorithm using a sample set of size N, 
{(£p,Yp)}. For nonlinear modeling, the effectiveness of the model is dependent 
on the membership functions used. 

The TSK fuzzy rules are employed in the ANFIS model 


Rë IF z is Åi, THEN y= fi(x) = aie Ajj U5 + Qi 0; c= I, ceng K; 


where A; = {A}, A?,..., A? } are fuzzy sets and a; j, j =0,1,...,n, are conse- 
quent parameters. The output of the network for pattern p is thus given by 
K 
A dint HA; (£p) fila) 22.11 
Yp = K ’ ( ë ) 
Dini MA; (£p) 
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where ua, (@p) = Aj Has (£p) = [jas Hai (Ep). Accordingly, the error mea- 
a I 
sure for pattern p is defined by 


R, 
Ep = (Gp — Yp)? - (22.12) 


After the rule base is specified, ANFIS adjusts only the membership functions 
of the antecedents and the consequent parameters. The BP algorithm can be 
used to train both the premise and consequent parameters. A more efficient 
procedure is to learn the premise parameters by BP, but to learn the linear 
consequent parameters a;,; by the RLS method [36]. The learning rate 7 can 
be adaptively adjusted by using a heuristic used for MLP learning. This hybrid 
learning method provides better results than the MLP trained with BP and 
the cascade-correlation network [36]. Second-order methods are also applied for 
training ANFIS. Compared to the hybrid method, the LM method for ANFIS 
training [37] achieves a better precision, but the interpretability of the final 
membership functions is quite weak. In [12], RProp and RLS are used to learn 
the premise parameters and the consequent parameters, respectively. 

ANFIS is attractive for applications in view of its network structure and the 
standard learning algorithm. However, it is computationally expensive due to 
the curse-of-dimensionality problem arising from grid partitioning. Constraints 
on membership functions and initialization using prior knowledge cannot be 
provided to the ANFIS model due to the learning procedure. The learning results 
may be difficult to interpret. Thus, ANFIS is suitable for applications, where 
performance is more important than interpretation. In order to preserve the 
plausibility of ANFIS, one can add some regularization terms to the cost function 
so that some constraints on interpretability are considered [38]. 

Coactive ANFIS [60] is a generalization of ANFIS obtained by introducing 
nonlinearity into the TSK rules. Generalized ANFIS [3] is based on a generaliza- 
tion of the TSK model and a generalized Gaussian RBF network. The generalized 
fuzzy model is trained by using the generalized RBF network model, based on the 
functional equivalence between the two models. In sigmoid-ANFIS [93], only sig- 
moidal membership functions are employed. Sigmoid-ANFIS is a combination of 
the additive TSK-type MLP and the additive TSK-type fuzzy inference system. 
Additive TSK-type MLP, as an extended model of the MLP, is proved in [93] 
to be functionally equivalent to the TSK-type fuzzy inference system and to be 
a universal approximator. The sigmoid-ANFIS model adopts the interactive-or 
operator as its fuzzy connectives. In addition, the gradient-descent algorithm can 
also be directly applied to the TSK model without representing it in a network 
structure [64]. 

Unfolding-in-time is a method to transform a recurrent network into a feed- 
forward network so that the BP algorithm can be used. ANFIS-unfolded-in-time 
[75] is a method that duplicates ANFIS T times to integrate temporal infor- 
mation, where T is the number of time intervals needed in the specific problem. 
ANFIS-unfolded-in-time is designed for prediction of time series data. Simulation 
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results show that the recognition error is much smaller in ANFIS-unfolded-in- 
time compared to that in ANFIS. 

Self-organization has been introduced in hybrid systems to create adaptive 
models for adaptively representing time-varying systems and model identifica- 
tion. Adaptation can be through one of the two strategies: a fixed space par- 
titioning with adaptive fuzzy rule parameters or a simultaneous adaptation of 
space partitioning and fuzzy rule parameters. While the ANFIS model belongs to 
the former category, adaptive parsimonious neurofuzzy systems can be achieved 
by using a constructive approach and the latter strategy [89]. The dynamic fuzzy 
neural network [89] is an online, constructive implementation of the TSK fuzzy 
system based on an extended RBF network and its learning algorithm. The 
extended RBF network has five layers and no bias, and the weights may be a 
linear regression of the input. 


Example 22.1: 

We use the ANFIS model to solve the IRIS classification problem. For the 
120 patterns, the ranges of the input and output variables are xı € [4.3, 7.9], 
x2 E [2.0, 4.4], z3 € [1.0,6.9], z4 € [0.1, 2.5], y € [1,3]. 

An initial TSK fuzzy inference system is first generated by using grid parti- 
tioning. Each of the variables is partitioned into 3 subsets. The Gaussian mem- 
bership function is selected. The maximum epochs is 100. The fuzzy partitioning 
for the input space as well as the training error is illustrated in Fig. 22.3. The 
classification error rate is 0. The ANFIS model generates 193 nodes, 405 linear 
parameters, 24 nonlinear parameters, and 81 fuzzy rules. The training time is 
53.70 s. The classification error for the training set is 0. 


Example 22.2: We solve the IRIS problem using the ANFIS with scatter par- 
titioning. Clustering the input space is a desired method for generating fuzzy 
rules. This can significantly reduce the total number of fuzzy rules, hence offer 
a better generalization capability. Subtractive clustering is used for rule extrac- 
tion so as to find an initial fuzzy inference system for ANFIS training. Radius 
r € [0,1] specifies the range of influence of the cluster center for each input or 
output dimension. The training error can be controlled by adjusting r. Specify- 
ing a smaller cluster radius usually yields more, smaller clusters in the data, and 
hence more rules. 

Since the range of the input space is very small compared to that of the output 
space, we select r = 0.8 for all the input dimensions and the output space. The 
training time is 1.4203 s for 200 epochs. After training the MSE testing error 
is 0.0126. The ANFIS model has 37 nodes, 15 linear parameters, 24 nonlinear 
parameters, and 3 fuzzy rules. The classification error is 1.33%. The scatter 
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Figure 22.3 IRIS classification: grid partitioning of the input space. (a) The initialized membership 
functions. (b) The learned membership functions. (c) The training RMS error. 


partitioning is shown in Fig. 22.4a, b, and the training and testing errors are 
illustrated in Fig. 22.4c. 

In order to further increase the training accuracy, we can select r = 0.3 for 
all the input dimensions and the output space to get finer clustering. Then we 
can get more rules. The ANFIS model has 107 nodes, 50 linear parameters, 80 
nonlinear parameters, and 10 fuzzy rules. The training time is 3.3866 s for 200 
epochs. After training, the MSE testing error is 1.5634 x 10~°. The classification 
error is 0. The result is shown in Fig. 22.5. 

For the 10 rules generated, each rule has its own membership function for each 
input variable. For example, the ith rule is given by 


Ri: IF Tı is Mil AND T2 is Hi,2 AND T3 is Hi,3 AND T4 is Hi,4 THEN y is Hi,y 
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Figure 22.4 IRIS classification: scatter partitioning of the input space. (a) The initialized membership 
functions. (b) The learned membership functions. (c) The training RMS error. (d) The 3 fuzzy rules 
generated. Note that some membership functions coincide. r = [0.8, 0.8, 0.8, 0.8, 0.8]. 


where Hik, k =1,...,4, and Hipy are membership functions. Each row of plots 
in Fig. 22.5d corresponds to one rule, and each column corresponds to either an 
input variable x; or the output variable y. 


Example 22.3: Data is generated from the Mackey-Glass time-delay differential 


equation defined by 
dx(t) 0.2x(t — T) 
— = —  — —- 0.12 (t). 
a tee OO 


When z(0) = 1.2 and T = 17, we have a non-periodic and non-convergent time 
series that is very sensitive to initial conditions. We assume z(t) = 0 when t < 0. 
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Figure 22.5 IRIS classification: scatter partitioning of the input space. (a) The initialized membership 
functions. (b) The learned membership functions. (c) The training RMS error. (d) The 10 fuzzy rules 
generated. Note that some membership functions coincide. r = [0.3, 0.3, 0.3, 0.3, 0.3]. 


We build an ANFIS that can predict x(t + 6) from the past values of this time 
series, that is, x(t — 18), x(t — 12), x(t — 6), and x(t). Therefore the training data 
format is [a(t — 18), z(t — 12), z(t — 6), x(t); z(t + 6]. From t = 118 to 1117, we 
collect 1000 data pairs of the above format. The first 500 are used for training 
while the others are used for checking. 

We first generate an initial fuzzy inference system employing 2 membership 
functions using the generalized bell function and a grid partition using the train- 
ing data, and then applying ANFIS. The number of training epochs is 10. The 
first 100 data points are ignored to avoid the transient portion of the data. The 
result is shown in Fig. 22.6. 
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Figure 22.6 ANFIS regression. (a) Data from Mackey-Glass chaotic time series. (b) The learned 
membership functions. (c) The training and testing RMS errors. (d) The prediction error. 


22.6 


Fuzzy SVMs 


In the support-vector based fuzzy neural network for classification [51], the initial 
number of rules is equal to the number of Gaussian-kernel support vectors. A 
learning algorithm is then used to remove irrelevant fuzzy rules. The consequent 
part of each rule is of the fuzzy singleton type. A learning algorithm consists 
of three learning phases. First, the fuzzy rules and membership functions are 
determined by clustering. Then, the parameters of fuzzy neural network are cal- 
culated by the SVM with an adaptive fuzzy kernel function. Finally, the relevant 
fuzzy rules are selected. 

The self-organizing TSK-type fuzzy network with support vector learning [43] 
is a fuzzy system constructed by the hybridization of fuzzy clustering and SVM. 
The antecedent part is generated via fuzzy clustering of the input data, and then 
SVM is used to tune the consequent part parameters. 
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SVR is incapable of interpreting local behavior of the estimated models. Local 
SVR can model local behavior of models, but it still has the problem of bound- 
ary effects, which may generate a large bias on the boundary and also need 
more time to calculate. Fuzzy weighted SVR with fuzzy partition [18] eliminates 
the boundary effects. It first employs FCM clustering to split the training data 
into several training subsets. Then, local-regression models are independently 
obtained by SVR for each training subset. Those local-regression models are 
then combined by a fuzzy weighted mechanism to form the output. The pro- 
posed approach needs less computational time than the local SVR does and can 
have more accurate results than the local/global SVR does. 

In the fuzzy SVR method of [39], the weighting vector gained in SVM is 
regarded as an initial weighting vector in a fuzzy neural network. RLS is then 
used to tune the weights of the fuzzy neural network. In a fuzzy modeling network 
based on SVM for regression and classification [17], the fuzzy basis function is 
regarded as the kernel function in an SVM, and fuzzy rules are generated based 
on the extracted support vectors. In [17], the rule base of a fuzzy system is 
extracted from learned SVM. A zeroth-order TSK fuzzy system is obtained; the 
number of fuzzy rules is equal to the number of support vectors. 

TSK-based SVR [44] is motivated by TSK-type fuzzy rules and its parameters 
are learned by a combination of fuzzy clustering and linear SVR. In contrast to 
a kernel-based SVR, TSK-based SVR has a smaller number of parameters while 
retaining SVR’s generalization ability. In TSK-based SVR, a one-pass clustering 
algorithm clusters the input training data; a TSK-kernel, which corresponds to 
a TSK-type fuzzy rule, is then constructed by the product of a cluster output 
and a linear combination of input variables. The output is a linear weighted sum 
of the TSK-kernels. 

The support vector interval regression networks [40] utilize the e-SVR for inter- 
val regression analysis. A two-step approach is proposed by constructing two 
independent RBF networks, identifying the lower and upper bounds of the data 
interval, respectively, after determining the initial structures and parameters of 
support vector interval regression networks through SVR mechanism. BP learn- 
ing is employed to adjust the RBF networks. Support vector interval regression 
machine [31] evaluates the interval regression model combining the possibility 
estimation formulation and the property of central tendency with the principle 
of «-SVR. It performs better for the data set with outliers and is conceptually 
simpler than support vector interval regression networks [40]. 

SVM is very sensitive to outliers or noises since the penalty term of SVM treats 
every data point equally in the training process. This may result in the occurrence 
of overfitting problem. Fuzzy SVM [50] deals with the overfitting problem. In 
fuzzy SVMs, training examples are assigned different fuzzy membership values 
based on their importance, and these membership values are incorporated into 
the SVM learning algorithm to make it less sensitive to outliers and noise. In [6], 
fuzzy SVM is improved for class imbalance learning in the presence of outliers 
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and noise. A kernel FCM clustering-based fuzzy SVM algorithm [92] deals with 
the classification problems with outliers or noises. 

As there exist problems of finite samples and uncertain data in the estima- 
tion, in [90], the input and output variables are described as fuzzy numbers, 
and fuzzy v-SVM is proposed by combining the fuzzy theory with v-SVM. A 
fuzzy version of LS-SVM in [81] removes the unclassifiable regions for multiclass 
problems. Fuzzy one-class SVM [27] incorporates the concept of fuzzy set theory 
into the one-class SVM model. It treats the training data points with different 
importance in the training process. For binary classification, hard-margin SVMs 
are reformulated into fuzzy rough set-based SVMs with new constraints in which 
the membership is taken into account [16]. 

Total margin-based adaptive fuzzy SVM (TAF-SVM) [55] solves the class- 
boundary-skew problem due to the very imbalanced data sets and the overfitting 
problem resulted from outliers. By introducing the total margin algorithm to 
replace the conventional soft-margin algorithm, it achieves a lower generalization 
error bound. 


Other neurofuzzy models 


Neurofuzzy systems can employ network topologies similar to those of layered 
feedforward networks [28, 19], the RBF network [89], the SOM model [84], and 
recurrent networks [52], mainly used for function approximation. These models 
typically have a layered feedforward network architecture and are based on TSK- 
type fuzzy inference systems. Neurofuzzy systems are usually trained by using 
the gradient-descent method. Gradient descent in this case is sometimes termed 
as fuzzy BP, and CG for training a neurofuzzy systems is also known as the fuzzy 
CG algorithm [53]. 

Fuzzy clustering is primarily based on competitive learning networks such as 
the Kohonen network and the ART models. Based on the fuzzification of the 
linear autoassociative neural networks, fuzzy PCA [19] can extract a number of 
relevant features from high-dimensional fuzzy data. A fuzzy wavelet network has 
also been introduced for approximating arbitrary nonlinear functions in [29]. 

The fuzzy perceptron network [14] is a type-I neurofuzzy system. The input to 
the network can be either fuzzy IF-THEN rules or numerical data. The learning 
scheme is derived based on the a-cut concept, which extends perceptron learn- 
ing to fuzzy input vectors. Moreover, the fuzzy pocket algorithm is derived in 
[14] and then further incorporated into the fuzzy perceptron learning scheme 
to tackle inseparable cases. Fuzzy BP [77] shows considerably greater conver- 
gence speed than BP does and can easily escape from local minima. For the 
aggregation of input values (forward propagation), the Sugeno fuzzy integral is 
employed. For weight learning, error backpropagation takes place. QuickFBP is 
a modification of fuzzy BP, where the modified computation of the net function 
is significantly faster [63]. Fuzzy BP is proved to be of exponential complexity in 
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the case of large-sized networks with a large number of inputs, while QuickF BP 
is of polynomial complexity. The concept of fuzzy kernel perceptron is presented 
in [15]. 

Hybrid neural fuzzy inference system (HyFIS) [46] is a five-layer neurofuzzy 
model based on the Mamdani fuzzy inference system. Layer 1 is the input layer 
with crisp values. Expert knowledge can be used for the initialization of these 
membership functions. HyFIS employs a hybrid learning scheme comprised of 
two phases, namely, rule generation from data (structure learning) and rule tun- 
ing using BP learning (parameter learning). HyFIS first extracts fuzzy rules from 
data by using the look-up-table technique [85]. This is used as the initial struc- 
ture so that the learning process can be fast, reliable and highly intuitive. The 
gradient-descent method is then applied to tune the membership functions of 
input/output linguistic variables and the network weights. Only a few training 
iterations are needed for the model to converge, since the initial structure and 
weights of the model are set properly. HyFIS is comparable in performance with 
that of ANFIS. 

Fuzzy min-max neural networks are a class of neurofuzzy models using min- 
max hyperboxes for clustering, classification and regression [73, 74, 24]. The 
max-min fuzzy Hopfield network [52] is a fuzzy recurrent network for fuzzy asso- 
ciative memory. The manipulations of the hyperboxes involve mainly compar- 
ison, addition and subtraction operations, thus learning is extremely efficient. 
An implicative fuzzy associative memory [79] consists of a network of completely 
interconnected Pedrycz logic neurons with threshold whose connection weights 
are determined by the minimum of implications of presynaptic and postsynaptic 
activations. 

Fuzzy k-NN [45] replaces the k-NN rule by associating each sample with a 
membership value expressing how closely the pattern belongs to a given class. 
In fuzzy k-NN, the importance of a neighbor is determined based on the relative 
distance between the neighbor and the test pattern. The proposed algorithm 
possesses the possibilistic classification capability. The algorithm considers all 
training patterns as the neighbors with different degrees. It avoids the problem 
of choosing the optimal value of K. 

To reduce the effect of outliers, fuzzy memberships are introduced to robustly 
estimate the scatter matrices, leading to fuzzy LDA [13]. Fuzzy LDA still cannot 
accommodate nonlinearly separable cases. Fuzzy Fisherface is proposed for face 
recognition via fuzzy set [49]. Fuzzy 2DLDA [91] extends fuzzy Fisherface to 
image matrix. 

A fuzzy inference system is adopted to determine the learning rate for BSS 
[56], yielding good signal separation. Fuzzy FastICA [30] can handle fuzziness in 
the iterative algorithm by using clustering as preprocessing. 

Generic fuzzy perceptron [61] has a structure similar to that of the three- 
layer MLP. The network inputs and the weights are modeled as fuzzy sets, and 
t-norm or t-conorm is used as the activation at each unit. The hidden layer 
acts as the rule layer. The output units usually use a defuzzufication function. 
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Figure 22.7 NEFPROX as an example of the generic fuzzy perceptron model. 


Generic fuzzy perceptron can interpret its structure in the form of linguistic 
rules and the structure of generic fuzzy perceptron can be treated as a linguistic 
rule base, where the weights between the input and hidden (rule) layers are 
called fuzzy antecedent weights and the weights between the hidden (rule) and 
output layers, fuzzy consequent weights. The generic fuzzy perceptron model is 
based on the Mamdani model. Due to the use of nondifferentiable t-norm and t- 
conorm, the gradient-descent method cannot be applied. A set of linguistic rules 
are used for describing the performance of the models. This knowledge-based 
fuzzy error is independent of the range of the output value. Based on the generic 
fuzzy perceptron model, there are three fuzzy models [61], namely neurofuzzy 
controller (NEFCON), neurofuzzy classification (NEFCLASS) and neuronfuzzy 
function approximation (NEFPROX). Learning algorithms for all these models 
are derived from the fuzzy error using simple heuristics. 

Initial fuzzy partitions are needed to be specified for each input variable. Some 
connections with identical linguistic values are forced to have the same weights 
so as to keep the interpretability. Prior knowledge can be integrated in the form 
of fuzzy rules to initialize the neurofuzzy systems, and the remaining rules are 
obtained by learning. NEFCON has a single output node, and is used for control. 
A reinforcement learning algorithm is used for online learning. NEFCLASS and 
NEFPROX can learn rules by using supervised learning. NEFCLASS does not 
use membership functions in the rules’ consequents. NETPROX is more general. 
The architecture of NETPROX is shown in Fig. 22.7. When there is only a 
single output, NEFPROX has the same architecture as NEFCON, and when 
no membership functions are used in the consequent parts, NEFPROX has the 
same architecture as NEFCLASS. The hidden layer is the rule layer with each 
node corresponding to a rule, and the output layer is the defuzzification layer. 
All ud) and pe? are, respectively, the membership functions used in the premise 
and consequent parts. NEFPROX is an order of magnitude faster than ANFIS, 
but with a higher approximation error [61]. NEFPROX represents a Mamdani 
system with too many rules. 
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In continuous fuzzy reinforcement learning, fuzzy inference system is used to 
obtain an approximate model for the value function in continuous state space 
and to generate continuous actions [8]. A critic-only based fuzzy reinforcement 
learning uses only a fuzzy system for approximating action value function, and 
generates action with probability proportional to this function; hence, such an 
action selection strategy allows possible implementation of balance. Fuzzy Q- 
learning [42] is a Q-learning method with linear function approximators using 
fuzzy systems, is a critic-only fuzzy reinforcement learning algorithm, and is 
applied to some problems with continuous state and action spaces. Fuzzy Sarsa 
learning [20] is based on linear Sarsa, and the existence of stationary points is 
established for it. It is an extension of Sarsa for continuous state and action 
spaces using fuzzy system as function approximator. It achieves higher learning 
speed and action quality compared to that of fuzzy Q-learning. A fuzzy balance 
management scheme between exploration and exploitation can be implemented 
in any critic-only fuzzy reinforcement learning method [21]. Establishing balance 
greatly depends on the accuracy of action value function approximation. An 
enhanced fuzzy Sarsa learning [21] integrates an adaptive learning rate and a 
fuzzy balancer into fuzzy Sarsa learning. 

Interval-valued data provides a way of representing the available information 
in complex problems with uncertainty, inaccuracy or variability. Interval analysis 
can be combined with neural networks to solve decision support systems. When 
a neural network has at least one of its input, output or weight sets being interval 
valued, it is an interval neural network. The interval neural network proposed 
for fuzzy regression analysis in [33] has all its weights, biases and output being 
interval valued, but input data being crisp. This kind of interval neural network 
is proved to have the capability of universal approximation [4]. Interval MLP 
[72] approximates nonlinear interval functions with interval-valued inputs and 
outputs, where weights and biases are single-valued, but its transfer function 
can operate with interval-valued inputs and outputs. Interval-SVM [2] directly 
incorporates an interval approach into the SVMs by inserting information into 
the SVM in the form of intervals. 


22.1 The fuzzy AND and OR neurons are intrinsically excitatory, since higher 
input implies higher output. To generate inhibitory behaviors, one can negate 
the input x, that is, 1 — x. A nonlinear activation function can also be applied 
to the AND or OR neuron output. These neurons can constitute a layered fuzzy 
neural network. Consider how a bias can be incorporated into the definition of 
the AND and OR neuron. 


22.2 Train an ANFIS model to identify the nonlinear dynamic system given by 


pa eed) 


Tfh +a- 1") 
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The input u(n) is uniformly selected in [—1,1] and the test input u(n) = 
0.5 sin(27n/29). Produce 20,000 and 200 observation data for training and test- 
ing, respectively. 


References 


1] S. Abe & M.S. Lan, Fuzzy rules extraction directly from numerical data for 
function approximation. IEEE Trans. Syst. Man Cybern., 25:1 (1995), 119- 
129. 

2] C. Angulo, D. Anguita, L. Gonzalez-Abril & J.A. Ortega, Support vector 

machines for interval discriminant analysis. Neurocomput., 71 (2008), 1220- 

1229. 

3] M.F. Azeem, M. Hanmandlu & N. Ahmad, Generalization of adaptive neuro- 

fuzzy inference systems. IEEE Trans. Neural Netw., 11:6 (2000), 1332-1346. 

4) M.R. Baker & R.B. Patil, Universal approximation theorem for interval neu- 

ral networks. Reliab. Comput., 4 (1998), 235-239. 

5] N.H. Barakat & A.P. Bradley, Rule extraction from support vector machines: 

A sequential covering approach. IEEE Trans. Knowl. Data Eng., 19:6 (2007), 

729-741. 

6] R. Batuwita & V. Palade, FSVM-CIL: Fuzzy support vector machines for 

class imbalance learning. IEEE Trans. Fuzzy Syst., 18:3 (2010), 558-571. 

7| J.M. Benitez, J.L. Castro & I. Requena, Are artificial neural networks black 

boxes? IEEE Trans. Neural Netw., 8:5 (1997), 1156-1164. 

8] H.R. Berenji & D. Vengerov, A convergent actor-critic-based FRL algorithm 

with application to power management of wireless transmitters. IEEE Trans. 

Fuzzy Syst., 11:4 (2003), 478-485. 

9] J.L. Castro, C.J. Mantas & J.M. Benitez, Interpretation of artificial neural 
networks by means of fuzzy rules. IEEE Trans. Neural Netw., 13:1 (2002), 
101-116. 

10] J.L. Castro, L.D. Flores-Hidalgo, C.J. Mantas & J.M. Puche, Extraction 

of fuzzy rules from support vector machines. Fuzzy Sets Syst., 158 (2007), 

2057-2077. 

11] A. Cechin, U. Epperlein, B. Koppenhoefer & W. Rosenstiel, The extraction 

of Sugeno fuzzy rules from neural networks. In: M. Verleysen, Ed., Proc. Eur. 

Symp. Artif. Neural Netw., Bruges, Belgium, 1996, 49-54. 

12] M.S. Chen & R.J. Liou, An efficient learning method of fuzzy inference 

system. In: Proc. IEEE Int. Fuzzy Syst. Conf., Seoul, Korea, 1999, 634-638. 

13] Z.-P. Chen, J.-H. Jiang, Y. Li, Y.-Z. Liang & R.-Q. Yu, Fuzzy linear dis- 

criminant analysis for chemical datasets. Chemometrics Intell. Lab. Syst., 45 

(1999), 295-302. 

14] J.L. Chen & J.Y. Chang, Fuzzy perceptron neural networks for classifiers 

with numerical data and linguistic rules as inputs. IEEE Trans. Fuzzy Syst., 

8:6 (2000), 730-745. 

















ww ai bbt.com DOOOO00 











REFERENCES 733 


15] J.-H. Chen & C.-S. Chen, Fuzzy kernel perceptron. IEEE Trans. Neural 

Netw., 13:6 (2002), 1364-1373. 

16] D. Chen, Q. He & X. Wang, FRSVMs: Fuzzy rough set based support vector 

machines. Fuzzy Sets Syst., 161 (2010), 596-607. 

17| J.H. Chiang & P.Y. Hao, Support vector learning mechanism for fuzzy rule- 

based modeling: A new approach. IEEE Trans. Fuzzy Syst., 12:1 (2004), 1-12. 

18] C.-C. Chuang, Fuzzy weighted support vector regression with a fuzzy par- 

tition. IEEE Trans Syst. Man Cybern. B, 37:3 (2007), 630-640. 

19] T. Denoeux & M.H. Masson, Principal component analysis of fuzzy data 

using autoassociative neural networks. IEEE Trans. Fuzzy Syst., 12:3 (2004), 

336-349. 

20] V. Derhami, V.J. Majd & M.N. Ahmadabadi, Fuzzy Sarsa learning and 

the proof of existence of its stationary points. Asian J. Contr., 10: 5 (2008), 

535-549. 

21) V. Derhami, V.J. Majd & M.N. Ahmadabadi, Exploration and exploitation 

balance management in fuzzy reinforcement learning. Fuzzy Sets Syst., 161 

(2010), 578-595. 

22| W. Duch, Uncertainty of data, fuzzy membership functions, and multilayer 

perceptrons. IEEE Trans. Neural Netw., 16:1 (2005), 10-23. 

23| G. Fung, S. Sandilya & R. Rao, Rule extraction from linear support vector 

machines. In: Proc. 11th ACM SIGKDD Int. Conf. Know. Discov. in Data 

Mining (KDD), 2005, 32-40. 

24| B. Gabrays & A. Bargiela, General fuzzy min-max neural networks for clus- 

tering and classification. IEEE Trans. Neural Netw., 11:3 (2000), 769-783. 

25] S.I. Gallant, Connectionist expert systems. Commun. of ACM, 31:2 (1988), 

152-169. 

26] S. Guillaume, Designing fuzzy inference systems from data: An 

interpretability-oriented review. IEEE Trans. Fuzzy Syst., 9:3 (2001), 426- 

443. 

27| P.-Y. Hao, Fuzzy one-class support vector machines. Fuzzy Sets Syst., 159 

(2008), 2317-2336. 

28] Y. Hayashi, J.J. Buckley & E. Czogala, Fuzzy neural network with fuzzy 

signals and weights. Int. J. Intell. Syst., 8:4 (1993), 527-537. 

29] D.W.C. Ho, P.A. Zhang & J. Xu, Fuzzy wavelet networks for function learn- 

ing. IEEE Trans. Fuzzy Syst., 9:1 (2001), 200-211. 

30] K. Honda, H. Ichihashi, M. Ohue & K. Kitaguchi, Extraction of local inde- 

pendent components using fuzzy clustering. In: Proc. 6th Int. Conf. Soft Com- 

puting (IIZUKA2000), 2000, 837-842. 

31] C. Hwang, D.H. Hong & K.H. Seok, Support vector interval regression 

machine for crisp input and output data. Fuzzy Sets Syst., 157 (2006), 1114- 

1125. 

32| M. Ishikawa, Rule extraction by successive regularization. Neural Netw., 
13:10 (2000), 1171-1183. 











ww ai bbt.com DOOOO00 


734 


Chapter 22. Neurofuzzy systems 


33] H. Ishibuchi, H. Tanaka & H. Okada, An architecture of neural networks 

with interval weights and its application to fuzzy regression analysis. Fuzzy 

Sets Syst., 57 (1993), 27-39. 

34] H. Jacobsson, Rule extraction from recurrent neural networks: A taxonomy 

and review. Neural Comput., 17:6 (2005), 1223-1263. 

35] J.S.R. Jang & C.I. Sun, Functional equivalence between radial basis function 

Networks and fuzzy inference systems. IEEE Trans. Neural Netw., 4:1 (1993), 

156-159. 

36] J.S.R. Jang, ANFIS: Adaptive-network-based fuzzy inference systems. IEEE 

Trans. Syst. Man Cybern., 23:3 (1993), 665-685. 

37] J.S.R. Jang & E. Mizutani, Levenberg-Marquardt method for ANFIS learn- 

ing. In: Proc. Biennial Conf. North Amer. Fuzzy Inf. Process. Soc. (NAFIPS) 

Berkeley, CA, 1996, 87-91. 

38] J.S.R. Jang & C.I. Sun, Neuro-fuzzy modeling and control. Proc. IEEE, 

83:3 (1995), 378-406. 

39] J.-T. Jeng & T.-T. Lee, Support vector machines for the fuzzy neural net- 

works. In: Proc. IEEE Int. Conf. Syst. Man Cybern., 1999, 115-120. 

40] J.-T. Jeng, C.-C. Chuang & S.-F. Su, Support vector interval regression 

networks for interval regression analysis. Fuzzy Sets Syst., 138 (2003), 283- 

300. 

41] Y. Jin, Advanced Fuzzy Systems Design and Applications (Heidelberg: 

Physica-Verlag, 2003). 

42| L. Jouffe, Fuzzy inference system learning by reinforcement methods. IEEE 

Trans. Syst. Man Cybern. C, 28:3 (1998), 338-355. 

43] C.-F. Juang, S.-H. Chiu & S.-W. Chang, A self-organizing TS-type fuzzy 

network with support vector learning and its application to classification prob- 

lems. IEEE Trans. Fuzzy Syst., 15:5 (2007), 998-1008. 

44) C.-F. Juang & C.-D. Hsieh, TS-fuzzy system-based support vector regres- 

sion. Fuzzy Sets Syst., 160 (2009), 2486-2504. 

45| J.M. Keller, M.R. Gray & J.A. Givens Jr., A fuzzy K-nearest neighbor 

algorithm. IEEE Trans. Syst. Man Cybern., 15:4 (1985), 580-585. 

46] J. Kim & N. Kasabov, HyFIS: Adaptive neuro-fuzzy inference systems and 

their application to nonlinear dynamical systems. Neural Netw., 12 (1999), 

1301-1319. 

47| E. Kolman & M. Margaliot, Are artificial neural networks white boxes?, 

IEEE Trans. Neural Netw., 16:4 (2005), 844-852. 

48] E. Kolman & M. Margaliot, Extracting symbolic knowledge from recurrent 

neural networks — A fuzzy logic approach. Fuzzy Sets Syst., 160 (2009), 145- 

161. 

49] K.C. Kwak & W. Pedry, Face recognition using a fuzzy fisher classifier. 

Pattern Recogn., 38:10 (2005), 1717-1732. 

50] C.-F. Lin & $.-D. Wang, Fuzzy support vector machines. IEEE Trans. Neu- 
ral Netw., 13:2 (2002), 464-471. 








? 














ww ai bt. com DOOOO00 








REFERENCES 735 


51] C.-T. Lin, C.-M. Yeh, S.-F. Liang, J.-F. Chung & N. Kumar, Support-vector- 

based fuzzy neural network for pattern classification. IEEE Trans. Fuzzy Syst., 

14:1 (2006), 31-41. 

52] P. Liu, Max-min fuzzy Hopfield neural networks and an efficient learning 

algorithm. Fuzzy Sets Syst., 112 (2000), 41-49. 

53] P. Liu & H. Li, Efficient learning algorithms for three-layer regular feedfor- 

ward fuzzy neural networks. IEEE Trans. Neural Netw., 15:3 (2004), 545-558. 

54] P. Liu & H. Li, Hierarchical TS fuzzy system and its universal approxima- 

tion. Inf. Sci., 169 (2005), 279-303. 

55] Y.-H. Liu & Y.-T. Chen, Face recognition using total margin-based adaptive 

fuzzy support vector machines. IEEE Trans. Neural Netw., 18:1 (2007), 178- 

192. 

56] S.T. Lou & X.D. Zhang, Fuzzy-based learning rate determination for blind 

source separation. IEEE Trans. Fuzzy Syst., 11:3 (2003), 375-383. 

57| E.D. Lughofer, FLEXFIS: A robust incremental learning approach for evolv- 

ing Takagi-Sugeno fuzzy models. IEEE Trans. Fuzzy Syst., 16:6 (2008), 1393- 

1410. 

58] D. Martens, B. Baesens & T. Van Gestel, Decompositional rule extraction 

from support vector machines by active learning. IEEE Trans. Knowl. Data 

Eng., 21:2 (2009), 178-191. 

59] S. Mitra & Y. Hayashi, Neuro-fuzzy rule generation: Survey in soft comput- 

ing framework. IEEE Trans Neural Netw., 11:3 (2000), 748-768. 

60] E. Mizutani & J.S. Jang, Coactive neural fuzzy modeling. In: Proc. IEEE 

Int. Conf. Neural Netw., Perth, Australia, 1995, 2, 760-765. 

61] D. Nauck, F. Klawonn & R. Kruse, Foundations of Neuro-Fuzzy Systems 

(New York: Wiley, 1997). 

62] J.G. Nicholls, A.R. Martin & B.G. Wallace, From Neuron to Brain: A Cellu- 

lar and Molecular Approach to the Function of the Nervous System, 3rd Edn. 

(Sunderland, MA: Sinauer Associates, 1992). 

63] A. Nikov & S. Stoeva, Quick fuzzy backpropagation algorithm. Neural 

Netw., 14 (2001), 231-244. 

64] H. Nomura, I. Hayashi & N. Wakami, A learning method of fuzzy inference 

rules by descent method. In: Proc. IEEE Int. Conf. Fuzzy Syst., San Diego, 

CA, 1992, 203-210. 

65] H. Nunez, C. Angulo & A. Catala, Rule extraction from support vector 

machines. In: Proc. Eur. Symp. Artif. Neural Netw., 2002, 107-112. 

66] H. Nunez, C. Angulo & A. Catala, Rule-based learning systems for support 

vector machines. Neural Process. Lett., 24 (2006), 1-18. 

67] C.W. Omlin & C.L. Giles, Extraction of rules from discrete-time recurrent 

neural networks. Neural Netw., 9 (1996), 41-52. 

68] C.W. Omlin, K.K. Thornber & C.L. Giles, Fuzzy finite-state automata 
can be deterministically encoded into recurrent neural networks. IEEE Trans. 
Fuzzy Syst., 6 (1998), 76-89. 








ww ai bbt.com DOOOO00 


736 


Chapter 22. Neurofuzzy systems 


69] W. Pedrycz & A.F. Rocha, Fuzzy-set based models of neurons and 

knowledge-based networks. IEEE Trans. Fuzzy Syst., 1:4 (1993), 254-266. 

70] W. Pedrycz, M. Reformat & K. Li, OR/AND neurons and the development 

of interpretable logic models. IEEE Trans. Neural Netw., 17:3 (2006), 636-658. 

71| H.-J. Rong, N. Sundararajan, G.-B. Huang & P. Saratchandran, Sequential 

adaptive fuzzy inference system (SAFIS) for nonlinear system identification 

and prediction. Fuzzy Sets Syst., 157 (2006), 1260-1275. 

72| A.M.S. Roque, C. Mate, J. Arroyo & A. Sarabia, iMLP: applying multi- 

layer perceptrons to interval-valued data. Neural Process. Lett., 25 (2007), 

157-169. 

73| P.K. Simpson, Fuzzy min-max neural networks—Part I: classification. IEEE 

Trans. Neural Netw., 3 (1992), 776-786. 

74| P.K. Simpson, Fuzzy min-max neural networks—Part II: clustering. [EEE 

Trans. Fuzzy Syst., 1:1 (1993), 32-45. 

75) N.A. Sisman-Yilmaz, F.N. Alpaslan & L. Jain, ANFIS-unfolded-in-time for 

multivariate time series forecasting. Neurocomput., 61 (2004), 139-168. 

76| E. Soria-Olivas, J.D. Martin-Guerrero, G. Camps-Valls, A.J. Serrano-Lopez, 
J. Calpe-Maravilla & L. Gomez-Chova, A low-complexity fuzzy activation 
function for artificial neural networks. IEEE Trans. Neural Netw., 14:6 (2003), 
1576-1579. 

77| S. Stoeva & A. Nikov, A fuzzy backpropagation algorithm. Fuzzy Sets Syst., 

112 (2000), 27-39. 

78| C.T. Sun, Rule-base structure identification in an adaptive-network-based 

inference system. IEEE Trans. Fuzzy Syst., 2:1 (1994), 64-79. 

79| P. Sussner & M.E. Valle, Implicative fuzzy associative memories. IEEE 

Trans. Fuzzy Syst., 14:6 (2006), 793-807. 

80] R. Thawonmas & S. Abe, Function approximation based on fuzzy rules 

extracted from partitioned numerical data. IEEE Trans. Syst. Man Cybern. 

B, 29:4 (1999), 525-534. 

81] D. Tsujinishi & S. Abe, Fuzzy least squares support vector machines for 

multiclass problems. Neural Netw., 16 (2003), 785-792. 

82] K. Uehara & M. Fujise, Fuzzy inference based on families of a-level sets. 

IEEE Trans. Fuzzy Syst., 1:2 (1993), 111-124. 

83] A. Ultsch, R. Mantyk & G. Halmans, Connectionist knowledge acquisition 

tool: CONKAT. In: D.J. Hand, Ed., Artificial Intelligence Frontiers in Statis- 

tics: AI and Statistics III (London: Chapman & Hall, 1993), 256-263. 

84] P. Vuorimaa, Fuzzy self-organizing map. Fuzzy Sets Syst., 66:2 (1994), 223- 

231. 

85] L.X. Wang & J.M. Mendel, Generating fuzzy rules by learning from exam- 

ples. IEEE Trans. Syst. Man Cybern., 22:6 (1992), 1414-1427. 

86] L.X. Wang & J.M. Mendel, Fuzzy basis functions, universal approximation, 
and orthogonal least-squares learning. IEEE Trans. Neural Netw., 3:5 (1992), 
807-814. 














ww ai bbt.com DOOOO00 








REFERENCES 737 


87] L.X. Wang, Analysis and design of hierarchical fuzzy systems. IEEE Trans. 

Fuzzy Syst., 7:5 (1999), 617-624. 

88] L.X. Wang & C. Wei, Approximation accuracy of some neuro-fuzzy 

approaches. IEEE Trans. Fuzzy Syst., 8:4 (2000), 470-478. 

89] S. Wu & M.J. Er, Dynamic fuzzy neural networks—A novel approach to 

function approximation. IEEE Trans. Syst. Man Cybern. B, 30:2 (2000), 358- 

364. 

90] H.-S. Yan & D. Xu, An approach to estimating product design time based 

on fuzzy v-support vector machine. IEEE Trans. Neural Netw., 18:3 (2007), 

721-731. 

91] W. Yang, X. Yan, L. Zhang & C. Sun, Feature extraction based on fuzzy 

2DLDA. Neurocomput., 73 (2010), 1556-1561. 

92] X. Yang, G. Zhang, J. Lu & J. Ma, A kernel fuzzy c-means clustering- 

based fuzzy support vector machine algorithm for classification problems with 

outliers or noises. IEEE Trans. Fuzzy Syst., 19:1 (2011), 105-115. 

93] D. Zhang, X.L. Bai & K.Y. Cai, Extended neuro-fuzzy models of multilayer 

perceptrons. Fuzzy Sets Syst., 142 (2004), 221-242. 

94] S.-M. Zhou & J.Q. Gan, Constructing L2-SVM-based fuzzy classifiers in 
high-dimensional space with automatic model selection and fuzzy rule ranking. 
IEEE Trans. Fuzzy Syst., 15:3 (2007), 398-409. 








ww ai bbt.com DOOO000 


23 


23.1 


Neural circuits and parallel 
implementation 


Introduction 


Hardware technologies for implementing neural networks can be either analog or 
digital. Analog hardware is a good choice. The design of analog chips requires 
good theoretical knowledge of transistor physics as well as experience. Weights in 
a neural network can be coded by one single analog element (e.g., a resistor). Very 
simple rules as Kirchoff’s laws can be used to carry out addition of input signals. 
As an example, Boltzmann machines can be easily implemented by amplifying 
the natural noise present in analog devices. 

Analog implementation results in low-cost parallelism, low power, high sam- 
ple rate or bandwidth, and small size. However, connection storage is usually 
volatile, and inaccurate circuit parameters affect the computational accuracy. 
Analog VLSI circuits are sensitive to device mismatches, circuit layout, sensitiv- 
ity to ambient noise and to temperature, and parasitic elements; consequently, 
design automation in analog circuits is still quite primitive. In contrast, compo- 
nents in neural networks do not have to be of high precision or fast switching. 
The learning capability of neural networks can compensate initial device mis- 
matches and long-term drift of the device characteristics. The analog standard- 
cell method is especially suitable for VLSI neural designs [48]. 

Digital systems, though subject to discretization, power consumption and cir- 
cuit size, are preferred for high accuracy, high repeatability, low noise sensitivity, 
good testability, high flexibility, and compatibility with other types of prepro- 
cessing. Digital systems can be designed more easily using computer-aided design 
tools. Nevertheless, digital designs are slow in computing. The majority of the 
available digital chips use CMOS technology. 

Many neuro-chips have been designed and built. A major limitation of VLSI 
implementation for general-purpose neural networks is the large number of inter- 
connections and synapses. The number of synapses increases quadratically with 
that of neurons, and thus silicon area is mainly occupied by the synaptic cells and 
the interconnection channels. A single integrated circuit is planar with limited 
possibility for crossover connections. A possible solution would be using optical 
interconnections. Hardware implementations of neural networks are commonly 
based on building blocks, and thus allow for the inherent parallelism of neu- 
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ral networks. For highly localized networks such as cellular networks, the VLSI 
implementation is relatively simple. 

Field programmable gate arrays (FPGAs) provide an excellent, quick, general- 
purpose development platform for digital systems. FPGA implementation is a 
cheap alernative to VLSI for research or low production quantities. FPGAs have 
availability of IP (intellectual property) cores. They can be digitally configured 
(programmed) to implement virtually any digital system, with accessibility, re- 
programmability, and low costs. An FPGA platform supports development of 
fast, compact solutions, providing powerful integration of hardware design with 
the software-programming paradigm. Specifically, this integration is made pos- 
sible with the use of a general-purpose hardware description language such as 
VHDL or Verilog, for different FPGA chips or ASICs. Although FPGAs do not 
achieve the power, clock rate or gate density of custom chips, they provide a 
speed-up of several orders of magnitude compared to software simulation. 

A solution for overcoming the problem of limited FPGA density is to imple- 
ment separate parts of the same system by time-multiplexing a single FPGA 
chip through run-time reconfiguration. This technique has been used in BP algo- 
rithms, dividing the algorithm into three sequential stages: forward, backward 
and update stages. When the computations of one stage are completed, the 
FPGA is reconfigured for the next stage [31]. The efficiency of this approach 
depends on the reconfiguration time relative to the computation time. Finite 
precision errors are introduced due to the quantization of both the signals and 
the parameters. 

Another solution for overcoming the same problem is the use of pulse-stream 
arithmetic. The signals are stochastically coded in pulse sequences and therefore 
can be summed and multiplied using simple logic gates. This is a full digital 
implementation using analog circuitry. The pulse width modulation is a hybrid 
pulse stream technique that combines the advantages of both analog and digital 
VLSI implementations. In a pulse-stream based architecture, signals are encoded 
by using pulse amplitude, width, density, frequency or phase. Pulse-mode archi- 
tecture has a number of advantages over analog and conventional digital imple- 
mentations. For instance, signal multiplication can be realized by using a very 
simple digital circuit like an AND gate, and nonlinear activation function can 
also be easily implemented. An FPGA prototyping implementation of an on-chip 
BP algorithm presented in [42] uses parallel stochastic bit-streams. Pulse-stream 
based architectures are gaining support in hardware design of neural networks 
[36]. 

Two types of parallel general-purpose computers are SIMD (single instruction 
multiple data) and MIMD (multiple instruction multiple data). SIMD consists 
of a number of processors which execute the same instructions but on different 
data, whereas MIMD has a separate program for each processor. Fine-grain 
computers are usually SIMD, while coarse-grain computers tend to be MIMD. 
Systolic arrays take the advantage of laying out algorithms in two dimensions. 
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The name systolic is derived from the analogy of pumping blood through a heart 
and feeding data through a systolic array. 


Hardware/software codesign 


Restrictive design specifications such as high performance, reduced size, or low 
power consumption are difficult to fulfil with a software approach. However, both 
the research activity and commercial interest in neural/fuzzy hardware have 
been decreasing due to the important increase in speed of software solutions 
based on general-purpose microprocessors or digital signal processors (DSPs). 
Software approaches are characterized by their high versatility, whereas dedi- 
cated hardware implementations provide a suitable solution only when extreme 
requirements, in terms of speed, power consumption or size, are needed. 

An overview of the existing hardware implementations of neural and fuzzy 
systems is given in [73], where limitations, advantages, and bottlenecks of ana- 
log, digital, pulse stream (spiking) and other techniques are discussed. The use 
of hardware/software codesign is concluded as a means of exploiting the best 
from both the hardware and software techniques, as it allows a fast design of 
complex systems with the highest performance-cost ratio. Heterogeneous hard- 
ware/software technologies have emerged as an optimal solution for many sys- 
tems. This approach proposes the partition of the system into hardware and 
software parts by exploiting the advantages of both the hardware and software 
intrinsic characteristics. 

In [26], the ANFIS model is modified for efficient hardware/software imple- 
mentation. The piecewise multilinear ANFIS exhibits approximation capabilities 
and learning abilities comparable to those of generic ANFIS. Two different on- 
chip design approaches are presented: a high-performance parallel architecture 
for offline training and a pipelined architecture suitable for online parameter 
adaptation. The device contains an ARM embedded-processor core and a large 
FPGA. The processor provides flexibility and high precision to implement the 
learning algorithms, while the FPGA allows the development of high-speed infer- 
ence architectures for real-time embedded applications. The internal architecture 
is shown in Fig. 23.1. The processor subsystem contains a 32-bit ARM922T hard 
processor core, a memory subsystem, external memory interfaces and standard 
peripherals, while the FPGA block consists of an APEX 20KE-like architecture 
with resources for integration. 


Topics in digital circuit designs 


Multiplication-free architectures are attractive, since digital multiplication oper- 
ations in each neuron are very demanding in terms of time or chip area and create 
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Figure 23.1 Internal architecture of Altera’s Excalibur family used for implementation of the piecewise 
multilinear ANFIS. 


a bottleneck. In binary representation, multiplication between two integers can 
be substituted by a shift, if one of the integers is a power of two. 

The family of coordinate rotation digital computer (CORDIC) algorithms 
exploits the geometric properties of two-dimensional and three-dimensional vec- 
tor rotations for the fast computation of transcendental functions through addi- 
tions and shifts [2]. One version of these algorithms can be used for computing the 
exponential function e” by performing two-dimensional rotations in a hyperbolic 
coordinate system and making use of the relation e” = sinh(x) + cosh(x). 


Look-up tables 

When the logistic or hyperbolic tangent function is used in digital designs, an 
exponential function needs to be calculated. The value of an exponential function 
is usually computed by using a Taylor-series expansion, which requires many 
floating-point operations. In view of the piecewise-linear approximation of the 
sigmoidal and its derivative functions, we need to use two look-up tables to store 
many input-output associations. The output of a unit can be approximated by 
linear interpolation of the points in a table. Since the activation function usually 
has output in the interval (0,1) or (—1, 1), it would be possible to adopt a fixed- 
point representation. Most neural chips integrate look-up tables and fixed-point 
representation of the activation function in order to simplify the logic design and 
increase the processing speed. 

This method needs an external memory to store the look-up table, thus simul- 
taneous evaluation of the activation function for multiple neurons is not possible. 
In practical implementations, all the neurons are typically assumed to have the 
same activation functions. Depending on the type of processor, the calculation of 
nonlinear activation functions such as the logistic or hyperbolic tangent function 
may consume considerable time. Piecewise-linear approximation to the sigmoidal 
functions along with look-up tables works very well [36]. 
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Quantization effects 

In embedded systems design, the digital hardware constraints can be quite severe 
and the use of fixed-point arithmetic requires both a careful algorithm imple- 
mentation and a thorough analysis of quantization effects [6]. Typical examples 
are the devices for sensor networks [24], where the minimization of both area and 
power consumption is a strategic issue for any successful application. One of the 
possible solutions is to avoid hardware multipliers, which are quite demanding in 
terms of resources, especially if compared to that required by the adders. In most 
applications, a representation precision with 16-bit coding for weights and 8-bit 
coding for the outputs is sufficient for the convergence of a learning algorithm 
[10]. 

There are some cases for which the quantization effect is beneficial and the 
error rate becomes even lower with respect to the floating-point case. This is 
suggested by the large variation of the results for small k and is an already 
reported effect [5], which can be partially explained by the fact that the precision 
reduction acts as a pruning of the less important parameters of SVM. There are 
two byproducts of the quantization process. The first one is the increase of the 
sparsity of an SVM or a neural network, because some parameters are negligible 
with respect to the least significant bit and therefore, their values are rounded 
to zero. This increases the generalization ability. The second consequence of the 
quantization process is the reduction of the number of bits required to describe 
the network: This improves the generalization ability of a learning machine in 
the MDL framework. 


Circuits for neural-network models 


The sigmoidal function can be generated by utilizing the current-voltage char- 
acteristics of a differential pair operating in the subthreshold regime (see 
Fig. 23.2a). By cascading two differential pairs, the basic Gaussian circuit has 
been developed, which has the problem of asymmetry. By attaching one more 
differential pair, we get Gilbert Gaussian circuit. Figure 23.2b can be regarded as 
a circuit made by connecting input terminals in a Gilbert multiplier. It is often 
used as squaring circuit [89]. Because of the nonlinearity, the usage as a squarer 
suffers from narrow input range. However, by focusing on the single output of 
the differential current, it becomes a symmetric Gaussian circuit having a wide 
input range, yielding a Gilbert Gaussian circuit. 


Circuits for MLPs 


The computational power of the MLP using integer weights in a very restricted 
range has been analyzed in [28]. If the weights are not restricted to proper range 
and precision, a solution may not exist. For classification problems, an existence 
result is derived for calculating a weight range that is able to guarantee the 
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(a) 
Figure 23.2 Circuits for generating activation functions. (a) Sigmoidal circuit. (b) Gilbert Gaussian 
circuit. ©IEEE, 2010 [40]. 


existence of a solution as a function of the minimum distance between patterns 
of different classes. 

A parallel VLSI implementation of MLP with BP conducted in [62] uses only 
shift-and-add and rounding operations instead of multiplications. In the forward 
phase, weights are restricted to powers-of-two, and the activation function is 
computed through look-up table. This avoids the multiplication operation. In the 
backward phase, the derivative of the activation function is computed through 
look-up table, some internal terms are rounded to the nearest powers-of-two, 
and external terms like 7 are selected as powers-of-two terms. Decomposition of 
binary integers into power-of-two terms can be accomplished very quickly and 
with a limited amount of circuitry [53]. The gain with respect to multiplication 
is more than one order of magnitude both in speed and in chip area, and thus 
the overhead due to the decomposition operation is negligible. The rounding 
operations introduce randomness that helps BP to escape from local minima. 
This randomness also helps to improve the generalization performance of the 
network. 

MLP with BP is implemented as a full digital system using pulse-mode neurons 
in [36]. A piecewise-linear activation function is used. The BP algorithm is sim- 
plified to make the hardware implementation easier. The derivative of the acti- 
vation function is generated by a pulse differentiator. A random pulse sequence 
is injected to the pulse differentiator output to improve the learning capability. 

The circuit complexity of a sigmoidal MLP can be examined in the framework 
of classical Boolean and threshold gate circuit complexity by converting a sig- 
moidal MLP into an equivalent threshold gate circuit [13]. Sigmoidal MLPs can 
be implemented in polynomial size Boolean circuits with a small constant fan-in 
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at the expense of an increase in the number of layers by a logarithmic factor. A 
survey of the circuit complexity for networks of binary perceptrons is available 
in [14], where nine different constructive solutions for the addition of two binary 
numbers using the threshold logic gate are discussed and compared. 

A method for parallel hardware implementation of neural networks using 
digital techniques is presented in [68]. Signals are represented using uniformly 
weighted single-bit streams, which can be obtained by using the front-end of 
a sigma-delta modulator. This single-bit representation offers significant advan- 
tages over multi-bit representations, since they mitigate the fan-in and fan-out 
issues which are typical to distributed systems. To process these bit streams using 
neural network concepts, functional elements which perform summing, scaling 
and squashing have been implemented. These elements are modular and have 
been designed to be easily interconnected. Using these functional elements, an 
MLP can be easily constructed. 

A CORDIC-like algorithm is proposed in [61] for computing the feedforward 
phase of an MLP in fixed-point arithmetic, using only shift and add operations 
and avoiding multiplications. Digital BP learning [64] is implemented for three- 
layered neural networks with nondifferentiable binary units. A neural network 
using digital BP learning is fast and easy to implement in hardware. In [11], a 
real-time reconfigurable perceptron circuit element is presented using subthresh- 
old operation. The circuit performs competitively with standard static CMOS 
implementation. 


Circuits for RBF networks 


Most hardware implementations for the RBF network are developed for the 
Gaussian RBF network. The properties of the MOS transistor are desirable for 
analog designs of the Gaussian RBF network. In the subthreshold or weak- 
inversion region, the drain current of the MOS transistor has an exponential 
dependence on the gate bias and dissipates very low power. This exponential 
characteristic of the MOS devices is usually exploited for designing the Gaussian 
function [89, 18, 54]. In [18], a compact analog Gaussian cell is designed whose 
core takes only eight transistors and can be supplied with a high number of input 
pairs. On the other hand, the MOS transistor has a square-law dependence on 
the bias voltages in its strong-inversion or saturation region, based on which a 
compact programmable analog Gaussian synapse cell has been designed in [20]. 

Similarity measure, typically the Euclidean distance measure, is essential in 
many neural-network models such as the clustering networks and the RBF net- 
work. The circuits for the Euclidean distance measure are usually based on the 
square-law property of the strong-inversion region [21, 54, 63, 95]. The gener- 
alized measure of similarity between two voltage inputs can be implemented 
by using the Gaussian-type or bump circuit and by using the bump-antibump 
circuit. These circuits are based on the concept of the current correlator devel- 
oped for weak-inversion operation [27]. Based on the current correlator, an ana- 
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log Gaussian/square function computation circuit is given in [54]. This circuit 
exhibits independent programmability for the center, width, and peak amplitude 
of the dc transfer curve. When operating in the strong-inversion region, it cal- 
culates the squared difference, whereas in the weak-inversion region it realizes 
a Gaussian-like function. In [21], circuits for calculating Euclidean distance and 
computing programmable Gaussian units with tunable center and variance are 
also designed based on the square-law of the strong-inversion region. A pulsed 
VLSI RBF network chip is fabricated in [63] where a collection of pulse-width- 
modulation analog circuits are combined on a single RBF network chip. The dis- 
tance metric is based on the square-law property of the strong-inversion region, 
and a Gaussian-like RBF is produced using two MOS transistors. 

In an analog VLSI circuit implementation of conic section function network 
neurons [95], the circuit computes both the weighted sum for the MLP and the 
Euclidean distance for the RBF. The two propagation rules are then aggregated 
as the design of synapse and neuron of the conic-section function network. A 
hybrid VLSI/digital design of the RBF network integrates a custom analog VLSI 
circuit and a commercially available digital signal processor [89]. 


Circuits for clustering 


Many WTA models can be achieved based on the continuous-time Hopfield net- 
work topology [59, 83], or based on the cellular network model with linear circuit 
complexity [79, 3]. There are also some circuits for realizing the WTA function 
such as a series of compact CMOS integrated circuits [47]. 

k-WTA networks are usually based on the continuous-time Hopfield network 
[59, 94]. The k-WTA circuit devised in [47] has infinite resolution, and is imple- 
mented using the Hopfield network based on the penalty method. In [87], a 
k-WTA circuit with O(n) interconnect complexity extends the WTA circuits 
given in [47], where n is the number of neurons. k-WTA is formulated as a 
mathematical programming problem solved by direct analog implementation of 
the Lagrange multiplier method. The circuit has merits of real-time responses 
and short wire length, although it has a finite resolution. A discrete-time math- 
ematical model of k-WTA neural circuit that can quickly identify the k winning 
nodes from n neurons is given and analyzed in [86]. For n competitors, the circuit 
is composed of n feedforward and one feedback hard-limiting neurons that are 
used to determine the dynamic shift of input signals. In a k-WTA network based 
on a one-neuron recurrent network [56], the k-WTA operation is first converted 
equivalently into a linear programming problem. Finite time convergence of the 
network is proved using the Lyapunov method. 

In [74], improved neural gas and its analog VLSI subcircuitry are developed 
based on partial sorting. The VLSI architecture includes two chips, one for the 
Euclidean distance computation and the other for the programmable sorting of 
code vectors. The latter is based on the WTA structure [47]. The approach is 
empirically shown to reduce the training time by up to two orders of magnitude, 
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without reducing the performance quality. A fully analog integrated circuit of 
SOM has been designed in [60]. 


Circuits for SVMs 


SVM learning can be formulated as a dynamic problem, which can then be solved 
using a two-layer recurrent network. The neural network is suitable for analog 
hardware implementation [4], [84]. A neural network training method for SVMs 
is proposed based on the theory of primal-dual neural networks [84], which is able 
to accomplish the training and testing of SVMs on large data sets in real time 
by using VLSI hardware implementation. In [7], a digital architecture for SVM 
learning is modeled as a dynamical system, and is implemented on an FPGA. 

A one-layer recurrent network for SVM learning is presented in [91] for pattern 
classification and regression. The proposed network is guaranteed to obtain the 
optimal solution of SVM and SVR. Compared with the existing two-layer neural 
network for SVM classification, the proposed network has a low complexity for 
implementation. Moreover, it can converge exponentially to the optimal solution 
of SVM learning. The realization for SVM learning is extended to more general 
optimization problems in [92]. It consists of a one-layer recurrent network whose 
steady-state solution satisfies the KKT conditions of the dual QP problem. An 
analog neural network for SVM learning is proposed in [69], based on a partially 
dual formulation of the QP problem. The performance is substantially equivalent 
to that of [91], in terms of both the settling time and the steady-state solutions. 

A CORDIC-like algorithm for computing the feedforward phase of an SVM in 
fixed-point arithmetic is proposed in [8], using only shift and add operations and 
avoiding multiplications. This result is obtained thanks to a hardware-friendly 
kernel, 


k(x, aj) = Q7 veel | (23.1) 





where the hyperparameter is an integer power of two, i.e. y = 2*? for integer p, 
and ||- ||; is the Lı norm. This kernel is proved to be an admissible Mercer’s 
kernel. It greatly simplifies the SVM feedforward phase computation. 

SVM with integer parameters [9] is based on a branch-and-bound procedure, 
derived from modern mixed integer QP solvers, and is useful for implement- 
ing the feedforward phase of SVM in fixed-point arithmetic. This allows the 
implementation of the SVM algorithm on resource-limited hardware for building 
sensor networks, where floating-point units are rarely available. 

An analog circuit architecture of Gaussian-kernel SVMs having on-chip train- 
ing capability has been developed in [40]. It has a scalable array processor config- 
uration and the circuit size increases only in proportion to the number of learning 
samples. The learning function is realized by attaching a small additional cir- 
cuitry to the SVM hardware composed of an array of Gaussian circuits. Although 
the system is inherently analog, the input and output signals including training 
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results are all available in digital format. Therefore, the learned parameters are 
easily stored and reused after training sessions. 


Circuits of other models 


An analog implementation of Hodgkin-Huxley model with 30 adjustable param- 
eters is described in [72], which requires 4 mm? area for a single neuron. A circuit 
for generating spiking activity is designed in [77]. The circuit integrates charge 
on a capacitor such that when the voltage on the capacitor reaches a certain 
threshold, two consecutive feedback cycles generate a voltage spike and then 
bring the capacitor back to its resting voltage. 

CAVIAR [80] is a massively parallel hardware implementation of a spike-based 
sensing-processing-learning-actuating system inspired by the physiology of the 
nervous system. It uses the asynchronous address-event representation commu- 
nication framework. It performs 12 giga synaptic operations per second, and 
achieves millisecond object recognition and tracking latencies. 

A signal processing circuit for a continuous-time recurrent network is imple- 
mented in [17] using subthreshold analog VLSI in mixed-mode (current and 
voltage) approach, where state variables are represented by voltages while neu- 
ral signals are conveyed as currents. The use of current allows for the accuracy of 
the neural signal stobe maintained over long distances, making this architecture 
relatively robust and scalable. 

In [37], layer-multiplexing technique is used to implement multilayer feedfor- 
ward networks into a Xilinx FPGA. The suggested layer multiplexing involves 
implementing only the layer having the largest number of neurons. A separate 
control block is designed, which appropriately selects the neurons from this layer 
to emulate the behavior of any other layer and assigns the appropriate inputs, 
weights, biases and excitation function for every neuron of the layer that is cur- 
rently being emulated in parallel. Each single neuron is implemented as a look-up 
table. 

In [55], a hardware architecture for GHA operates the principal component 
computation and weight vector updating of GHA in parallel. An FPGA imple- 
mentation of ICA algorithm for BSS and adaptive noise canceling is proposed 
in [41]. In [81], FastICA is implemented on an FPGA. A survey using VLSI 
approaches to ICA implementations has been described in [29]. There are some 
examples of hardware implementations of neural SVD algorithms [22], [57]. 

The mean-field annealing algorithm can be simulated by RC circuits, coupled 
with the local nature of the Boltzmann machine, which makes the mean-field- 
theory machine suitable for massively parallel VLSI implementation [49, 78]. 

Restricted Boltzmann machine can be mapped to a high-performance hard- 
ware architecture on FPGA platforms [51]. A method is presented to partition 
large restricted Boltzmann machines into smaller congruent components, allow- 
ing the distribution of one restricted Boltzmann machine across multiple FPGA 
resources. The framework is tested on a platform of four Xilinx Virtex I[Pro 
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XC2VP70 FPGAs running at 100 MHz through a variety of different configu- 
rations. The maximum performance was obtained by instantiating a restricted 
Boltzmann machine of 256 x 256 nodes distributed across four FPGAs, which 
resulted in a computational speed of 3.13 billion connection-updates-per-second 
and a speedup of 145-fold over an optimized C program running on a 2.8-GHz 
Intel processor. 


Fuzzy neural circuits 


Generally, fuzzy systems can be easily implemented in digital form, which can 
be either general-purpose microcontrollers running fuzzy inference and defuzzifi- 
cation programs, or dedicated fuzzy coprocessors, or RISC processors with spe- 
cialized fuzzy support, or fuzzy ASICs. 

Fuzzy coprocessors work in conjunction with a host processor. They are 
general-purpose hardware, and thus have a lower performance compared to a cus- 
tom fuzzy hardware. There are many commercially available fuzzy coprocessors 
[75]. RISC processors with specialized fuzzy support are also available [23, 75]. A 
fuzzy-specific extension to the instruction set is defined and implemented using 
hardware/software codesign techniques. The fuzzy-specific instructions signifi- 
cantly speed up fuzzy computation with no increase in the processor cycle time 
and with only a minor increase in the chip area. 

A common approach to general-purpose fuzzy hardware is to use a software 
design tool to generate the program code for a target microcontroller. Exam- 
ples include the Motorola-Aptronix fuzzy inference development language and 
Togai InfraLogic’s MicroFPL system [38]. Compared to dedicated fuzzy proces- 
sors and ASICs, this approach leads to rapid design and testing at the cost of 
low performance. 

The tool TROUT [38] automates fuzzy neural ASIC design. It produces a spec- 
ification for small, customized, application-specific circuits called smart parts. 
A smart part is a dedicated compact-size circuit customized to a single func- 
tion. A designer can package a smart part in a variety of ways. The model 
library of TROUT includes fuzzy or neural-network models for implementation 
as circuits. To synthesize a circuit, TROUT takes as its input an application 
data set, optionally augmented with user-supplied hints. It delivers, as output, 
technology-independent VHDL code, which describes a circuit implementing a 
specific fuzzy or neural-network model optimized for the input data set. As an 
example, TROUT has been used for the synthesis of the fuzzy min-max classifi- 
cation network. 

On a versatile neurofuzzy platform with a topology strongly influenced by the- 
ories of fuzzy modelling [34], various critical hardware design issues are identified. 
With a hybrid learning scheme involving structural and parametric optimization, 
fuzzy neural networks are well suited in forming the adaptive logic processing 
core of this platform. A resolution of at most 4 or 5 bits is necessary to achieve 
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performance on par with that of a continuous representation, with good results 
seen for as little as 2 bits of precision [34]. 

There are also many analog |52, 44] and mixed-signal [12, 15] fuzzy circuits. 
Analog circuits usually operate in current mode and are fabricated using CMOS 
technology, and this leads to the advantages of high speed, small-circuit area, 
high performance and low power dissipation. A design methodology for fuzzy 
ASICs and general-purpose fuzzy processors is given in [44], based on the left- 
right fuzzy implication cells and the left-right fuzzy arithmetic cells. In [12, 15], 
the fabrication of mixed-signal CMOS chips for fuzzy controllers is considered; 
in these circuits, the computing power is provided by the analog part while the 
digital part is used for programmability. 


Graphic processing unit (GPU) implementation 


GPUs were originally designed only to render triangles, lines and points. Graph- 
ics hardware is now used for general-purpose computations for performance- 
critical algorithms that can be efficiently expressed in the streaming nature of 
the GPU architecture. Graphics hardware has become competitive in terms of 
speed and programmability. 

GPUs are specialized stream processors. Stream processors are capable of tak- 
ing large batches of fragments that can be thought of as pixels in image pro- 
cessing applications, and computing similar independent calculations in parallel. 
Each calculation is with respect to a program, often called a kernel, which is an 
operation applied to every fragment in the stream. 

The computational power of GPUs is increasing significantly faster than cen- 
tral processing units (CPUs) [66]. GPUs have instructions to handle many linear 
algebra operations, such as the dot product, vector and matrix multiplication, 
and computing the determinant of a matrix, given their vector processing archi- 
tecture. GPUs are capable of executing more floating-point operations per second 
(flops) than CPUs do. A 3.0-GHz dual-core Pentium 4 can execute 24.6 Gflops, 
while an NVIDIA GeForceFX 7800 can execute 165 Gflops [66]. NVIDIA GeForce 
8800 GTX has 128 stream processors, a core clock of 575 MHz, a shader clock of 
1350 MHz, and is capable of over 350 Gflops, the peak bandwidth to the graphics 
memory is 86.4 gigabytes per second. The key concept required for converting a 
CPU program to a GPU program involves the idea that arrays are equivalent to 
textures. Data used in a GPU program are passed from the CPU as a texture. 
Due to the limited floating-point representation of the current GPUs, the accu- 
racy of GPU implementations is lower compared to that in the CPU case, but 
it is still adequate for a large range of applications. However, the floating-point 
representation of the graphic cards has been improved by latest GPU technology. 

Compute unified device architecture (CUDA) is a general-purpose parallel 
computing architecture developed by NVidia. It provides an application pro- 
gramming interface to a GPU that enables designers to easily create GPU code. 
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This model consists of three major abstractions: a hierarchy of thread groups, 
shared memories and barrier synchronization. Using these CUDA features, devel- 
opers can write C-like functions called kernels. Each kernel invocation is associ- 
ated with an ordered set of thread groups in a grid executing the kernel in parallel 
on the GPU. CUDA supports implementing parallel algorithms with topologi- 
cal processing, running the same code segment on all nodes, which is similar to 
the cellular network architecture. On top of CUDA, a CUBLAS library provides 
basic linear algebra subroutines, especially matrix multiplication. The library is 
self-contained at the API (application programming interface) level. The basic 
approach to use the CUBLAS library is to allocate memory space on the GPU 
memory, transfer data from the CPU to the GPU memory, call a sequence of 
CUBLAS, and transfer data from the GPU back to the CPU memory. 

There are three MATLAB toolboxes for GPU-based processing on the mar- 
ket: GPUmat (free, http: //gp-you. org/), Jacket (commercial), and the Paral- 
lel Computing Toolbox of MATLAB (commercial, MATLAB 2010b). Comput- 
ing using GPUmat usually only involves recasting variables and requires minor 
changes to MATLAB scripts or functions. 

A GPU implementation of the multiplication between the weights and the 
input vectors in each layer in the working phase of an MLP is proposed in [65]. 
A GPU implementation of the entire learning process of an RBF network [16] 
reduces the computational cost by about two orders of magnitude with respect 
to its CPU implementation. FCM has a large degree of inherent algorithmic 
parallelism. Many pattern recognition algorithms can be sped up on a GPU as 
long as the majority of computation at various stages and the components are not 
dependent on each other. A generalized method for offloading fuzzy clustering to 
a GPU (http://cirl.missouri.edu/gpu/) [1] leads to a speed increase of over 
two orders of magnitude for particular clustering configurations and platforms. 
GPUs are efficiently used for solving two fundamental graph problems: finding 
the connected components and finding a spanning tree [82]. In [85], the EKF 
algorithm for training of recurrent networks is implemented on GPU, where 
most computational intensive tasks are performed on the GPU. 

Restricted Boltzmann machines can be accelerated using a GPU [71]. The 
implementation was written in CUDA and tested on an NVIDIA GeForce GTX 
280. An implementation of the MPI developed specifically for embedded FPGA 
designs, called time-multiplexed differential data-transfer MPI (TMD-MPI) [76], 
extends the MPI protocol to hardware. The hardware implementation is con- 
trolled entirely with MPI software code, using messages to abstract the hardware 
compute engines as computational processes, called ranks. In addition to ease 
of use, this feature also provides portability and versatility, since each compute 
engine is compartmentalized into message-passing modules that can be inserted 
or removed based on available resources and desired functionality. 
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Implementation using systolic algorithms 


A systolic array of processors is a parallel digital hardware system. It is an 
array of data processing units which are connected to a small number of nearest 
neighbour data processing units in a mesh-like topology. Data processing units 
perform a sequence of operations on data that flows between them. 

A number of systolic algorithms are available for matrix-vector multiplication, 
the basic computation involved in the operation of a neural network. The sys- 
tolic architecture uses a locally communicative interconnection structure for dis- 
tributed computing. Systolic arrays can be readily implemented in programmable 
logic devices. 

QR decomposition is a key step in many DSP applications. Divide and square 
root operations in the Givens rotation algorithm can be avoided by using special 
operations such as CORDIC or special number systems such as the logarithmic 
number system. A two-dimensional systolic array QR decomposition is imple- 
mented [88] on an FPGA using the Givens rotation algorithm. This design uses 
straightforward floating-point divide and square root operations, which make it 
easier to be used within a larger system. The input matrix size can be configured 
at compile time to many different sizes. 

In a unified systolic architecture for implementing neural networks [45], proper 
ordering of the elements of the weight matrix makes it possible to design a 
cascaded dependency graph for consecutive matrix-vector multiplication, which 
requires the directions of data movement at both the input and output of the 
dependency graph to be identical. Using this cascaded dependency graph, the 
computations in both the recall and learning iterations of BP have been mapped 
onto a ring systolic array. The same mapping strategy has been used in [39] for 
mapping the HMM and the recursive BP network onto the ring systolic array. 
The main drawback of these implementations is the presence of spiral (global) 
communication links that damage the local property. 

In [50], a two-dimensional array is used to map the synaptic weights of indi- 
vidual weight layers in a neural network. By placing side by side the arrays 
corresponding to adjacent weight layers, both the recall and learning phases of 
BP can be executed efficiently. However, as the directions of data movement at 
the output and the input of each array are different, this leads to a nonuniform 
design. Again, a particular layout can only implement neural networks having 
identical structures. For neural networks that are structurally different, another 
layout would be necessary. 

MLP with BP learning has been implememted on systolic arrays [58], [32], 
[33]. In [58], dependency graphs are derived for implementing operations in both 
the recall and learning phases of BP. These dependency graphs are mapped onto 
a linear bidirectional systolic array and algorithms have been presented for exe- 
cuting both the recall and learning phases efficiently. In [32], BP is implemented 
online by using a pipelined adaptation, where a systolic array is implemented 
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on an FPGA. Parallelism is better exploited because both forward and back- 
ward phases can be performed simultaneously. In [33], a pipelined modification 
of online BP better exploits the parallelism because both the forward and back- 
ward phases can be performed simultaneously. 


Implementation using parallel computers 


Multiple-core computers are becoming the norm, and data sets are becom- 
ing ever larger. Most of the previous approaches have considered the paral- 
lel computer system as a cluster of independent processors, communicating 
through a message-passing scheme such as MPI (Message-Passing Interface, 
http://www.mpi-forum.org). MPI is a library of functions. It allows one to 
easily implement an algorithm in parallel by running multiple CPU processors 
for improving efficiency. MPI provides a straightforward software-hardware inter- 
face. The message-passing paradigm is widely used in high-performance comput- 
ing. 

The SIMD mode, where different processors execute the same program but dif- 
ferent data, is generally used in MPI for developing parallel programs. Advances 
in technology have resulted in systems where several processing cores have access 
to a single memory space, and such symmetric multiprocessing architectures 
are becoming prevalent. OpenMP (OpenMP Application Program Interface, 
http://www.openmp.org) works effectively on shared memory systems, while 
MPI can be used for message passing between nodes. Most high-performance 
computing systems are now clusters of symmetric multiprocessing nodes. On 
such hybrid systems, a combination of message-passing between symmetric mul- 
tiprocessing nodes and shared memory techniques inside each node could poten- 
tially offer the best parallelization performance from the architecture. A standard 
approach to combining the two schemes involves OpenMP parallelization inside 
each MPI process, while communication between the MPI processes is made only 
outside of the OpenMP regions [70]. 

The IBM RS/6000 SP system (http://www.rs6000.ibm.com/hardware/ 
largescale) is a scalable distributed-memory multiprocessor consisting of up to 
512 processing nodes connected by a high-speed switch. Each processing node is 
a specially packaged RS/6000 workstation CPU with local memory, local disk(s), 
and an interface to the high-performance switch. The SP2 parallel environment 
supports MPI for the development of message-passing applications. 

Network-partitioned parallel methods for the SOM algorithm, written in the 
SIMD programming model, preserve the recursive weight update, and hence pro- 
duce exact agreement with the serial algorithm. Data-partitioned algorithms offer 
the potential for much greater scalability since the parallel granularity is deter- 
mined by the volume of data, which is potentially very large. A data-partitioned 
parallel method for SOM [46] is based on the batch SOM formulation in which 
the neural weights are updated at the end of each pass over the training data. 
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The underlying serial algorithm is enhanced to take advantage of the sparse- 
ness often encountered in these data sets. Performance measurements on an SP2 
parallel computer demonstrate essentially linear speedup for the parallel batch 
SOM. 

A parallel implementation of linear SVM training [90] uses a combination of 
MPI and OpenMP. Using an interior-point method for optimization and a refor- 
mulation that avoids the dense Hessian matrix, the structure of the augmented 
system matrix is exploited to partition data and computations amongst paral- 
lel processors efficiently. The hybrid version performs more efficiently than the 
version using pure MPI. 

A parallel version of DBSCAN algorithm [93] uses the shared-nothing architec- 
ture with multiple computers interconnected through a network. A fundamental 
component of a shared-nothing system is its distributed data structure. The 
dR*-tree, a distributed spatial index structure, is introduced, in which the data 
is spread among multiple computers and the indexes of the data are replicated on 
every computer. Parallel DBSCAN offers nearly linear speedup and has excellent 
scaleup and sizeup behavior. 

Parallel SMO [19] is developed based on MPI. It first partitions the entire 
training data set into smaller subsets and then simultaneously runs multiple 
CPU processors to deal with each of the partitioned data sets. The efficiency of 
parallel SMO decreases with the increase of the number of processors, as there 
is more communication time for using more processors. For this reason, parallel 
SMO is more useful for large size problems. 

Probabilistic inference is examined through parallel computation on real multi- 
processors in [43]. Experiments are performed on a 32-processor Stanford DASH 
multiprocessor, a cache-coherent shared-address-space machine with physically 
distributed main memory. 


Implementation using cloud computing 


MapReduce is a distributed programming paradigm for cloud computing envi- 
ronment introduced at Google [25]. The model is shown in Fig. 23.3. In the first 
phase, the input data are processed by the map function, generating intermedi- 
ate results as the input of the reduce function in the second phase. Users specify 
the computation in terms of a map and a reduce function, and the underly- 
ing runtime system automatically parallelizes the computation across large-scale 
clusters of machines, handles machine failures, and schedules inter-machine com- 
munication. User can set the number of map functions to be used in the cloud. 
Map tasks are processed in parallel by the nodes in the cluster without sharing 
data with any other node. After all the map functions have completed their tasks, 
the outputs are transferred to reduce function(s). The reduce function produces 
a (possibly) smaller set of values. 
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Figure 23.3 The MapReduce framework: the map function is applied to all input records, while the 
generated immediate results are aggregated by the reduce function. 


Problems 


The < key, value > pair is the basic data structure in MapReduce, and all 
intermediate outputs are < key, value > pairs. For each input < key, value > 
pair, a map function is invoked once and some intermediate pairs are generated. 
These intermediate pairs are then shuffled by merging, grouping and sorting 
values by keys. A reduce function is invoked once for each shuffled intermediate 
pair and generate output pairs as final results. 

MapReduce has become an industry standard. Two open-source implemen- 
tations of MapReduce are Hadoop (http://hadoop.apache.org) and CGL- 
MapReduce [30]. Hadoop can be easily deployed on commodity hardware. 
Hadoop stores the intermediate computational results in local disks and then 
informs the appropriate workers to retrieve (pull) them for further process- 
ing. This strategy introduces a considerable communication overhead. Moreover, 
Hadoops MapReduce API does not support configuring a map task over mul- 
tiple iterations. CGL-MapReduce utilizes NaradaBrokering, a streaming-based 
content dissemination network, for all the communications. This eliminates the 
overheads associated with communicating via a file system. 

The IBM Parallel Machine Learning Toolbox (PML) is similar to the MapRe- 
duce model by providing APIs for easy parallel implementation. Unlike MapRe- 
duce, PML inplements iterative learning, which requires multiple passes over 
data. 

AdaBoost.PL and LogitBoost.PL are two parallel boosting algorithms, and 
they are implemented in MapReduce framework [67]. A parallel ELM for regres- 
sion is implemented in MapReduce framework [35]. 


23.1 Design a systolic array for multiplication of a vector and a matrix. 
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24.1 


Pattern recognition for biometrics 
and bioinformatics 


Biometrics 


Biometrics are the personal or physical characteristics of a person. These biomet- 
ric identities are usually used for identification or verification. Biometric recog- 
nition systems are increasingly being deployed as a more natural, more secure 
and more efficient means than the conventional password-based method for the 
recognition of people. Many biometric verification systems have been developed 
for global security. 

A biometric system may operate in either the verification or identification 
mode. The verification mode authenticates an individual’s identity by compar- 
ing the individual with his/her own template(s) (Am I whom I claim I am?). 
It conducts one-to-one comparison. The identification mode recognizes an indi- 
vidual by searching the entire template database for a match (Who am I?). It 
conducts one-to-many comparisons. 

Biometrics are usually classified into physiological biometrics and behavioral 
biometrics. Physiological biometrics use biometric characteristics that do not 
change with time. Some examples of these biometrics are fingerprint, face, facial 
thermogram, eye, eye’s iris, eye’s retina scan, ear, palmprint, footprint, palm, 
palm vein, hand vein, hand geometry and DNA. Signature is also known to be 
unique to every individual. Behavioral biometrics are dynamic characteristics 
that change over time. For recognition purpose, one has to record at a certain 
time duration, depending on the Nyquist theorem. Examples of such biometrics 
are speech, keystroke, signature, gesture and gait. Both types of biometrics can 
be fused for some complex systems. 

Biometric cues such as fingerprints, voice, face and signature are specific to 
an individual and characterizes that individual. Verification using fingerprints 
is the most widely used, as the fingerprint of an individual is unique [31]. The 
simplest, most pervasive in society, and least obtrusive biometric measure is that 
of human speech. Speech is unique for each individual. Typically, the biometric 
identifiers are scanned and processed in an appropriate algorithm to extract a 
feature vector, which is stored as a template in registration. 

Several companies such as Identix sell high-accuracy face recognition soft- 
ware with databases of more than 1,000 people. Face recognition is fast but 
not extremely reliable, while fingerprint verification is reliable but inefficient in 
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database retrieval. Faces and fingerprints can be integrated in a biometric sys- 
tem. 


Physiological biometrics and recognition 


Human faces convey a significant amount of nonverbal information to facilitate 
the real-world human-to-human communication. Modern intelligent systems are 
expected to have the capability to accurately recognize and interpret human faces 
in real time. Facial attributes, such as identity, age, gender, expression, and eth- 
nic origin, play a crucial role in applications of real facial image analysis includ- 
ing multimedia communication, human-computer interaction, and security. Face 
recognition has a wide range of commercial, security, surveilance and law enforce- 
ment applications, or even healthcare for helping patients with Alzheimer’s dis- 
ease. Face recognition is not as unobtrusive as other recognition methods such 
as fingerprint or other biometric recognition methods. 

Fingerprint and iris technologies currently offer greater accuracy than face 
recognition, but require explicit cooperation from the user. A fingerprint is the 
pattern of ridges and furrows on the surface of a fingertip. The uniqueness of a 
fingerprint is exclusively determined by the local ridge characteristics and their 
relationships. 

The eye’s iris is the colored region between the pupil and the white region 
(sclera) of the eye. The primary role of the iris is to dilate and constrict the 
size of the pupil. Iris is unique even for twins. The human iris is shown is in 
Fig. 24.1a. 

The eye’s retina [10] is essentially a sensory tissue which consists of multiple 
layers. It is located towards the back of the eye. Because of its internal location 
within the eye, the retina is not exposed to the external environment, and thus 
it possesses a very stable biometric. Retinal scanning devices can be purchased 
from EyeDentify, Inc. Uniqueness of retina comes from the uniqueness of the 
pattern distribution of the blood vessels at the top of the retina. The human 
retina is shown in Fig. 24.1b. 

Palmprint verification recognizes a person based on unique features in his 
palm, such as the principal lines, wrinkles, ridges, minutiae points, singular points 
and texture. Palmprint is a promising biometric feature for use in access control 
and forensic applications. The human palmprint is shown in Fig. 24.1c. 

The texture pattern produced by bending the finger knuckle is a highly dis- 
tinctive, biometric authentication system using finger-knuckle-print imaging. The 
human knuckle-print is shown in Fig. 24.1d. 

Hand vein is the subcutaneous vascular pattern/network appearing on the 
back of hand. Vein patterns are quite stable in the age group of 20-50. Some 
commercial products that authenticate individuals from hand vein images are 
also available. The human hand vein is shown in Fig. 24.le. Finger vein or hand 
vein using finger-vein patterns are extracted from an infrared finger image. 
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Figure 24.1 (a) Human iris: The image showing both limbic and pupil boundaries. (b) Retina image. 
(c) Palmprint. (d) Knuckle-print. (e) Human hand vein. 





Hand image based recognition is a viable secure access control scheme. In a 
preprocessing stage of the algorithm [42], the silhouettes of hand images are reg- 
istered to a fixed pose, which involves both rotation and translation of the hand 
and, separately, of the individual fingers. Two feature sets, namely Hausdorff 
distance of the hand contours and independent component features of the hand 
silhouette images, have been comparatively assessed. 

The tongue is a unique organ which can be stuck out of the mouth for inspec- 
tion. Tongue-print recognition [44] is a technology for noninvasive biometric 
assessment. The tongue can present both static features and dynamic features 
for authentication. 

Machine recognition of biometrics from images and videos involves several 
disciplines such as pattern recognition, image processing, computer vision and 
neural networks. Although any feature-extraction technique such as DCT or FFT 
can be used, neural techniques such as PCA, LDA and ICA are local methods 
and thus, are more attractive and widely used. After applying these transforms 
to the images, some of the coefficients are selected to construct feature vectors. 
Depending on the specific classification problem, the MLP, RBF network and 
SVM can be used for supervised classification, while the Hopfield network and 
clustering are used for unsupervised classification. 

Human retinal identification system [10] can be composed of three princi- 
pal modules including blood vessel segmentation, feature generation and feature 
matching. DCT is the most appropriate for feature extraction for fingerprint 
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recognition and for face recognition. Fingerprint registration is a critical step 
in fingerprint matching. Feature extraction of signatures can be based on the 
Hough transform, which detects straight lines from the presented images. Hough 
transform can be applied to edge-detected images, and Hough space image of 
the signature is obtained. 

For face recognition, the system first detects the faces in the images, segments 
the faces from the cluttered scene, extracts features from a face region, and 
identifies the face according to a stored database of faces. The face region can be 
found by clustering. The region is then preprocessed so as to extract prominent 
features for further recognition. 

Estimating human age automatically via facial image analysis has lots of 
potential real-world applications, such as human-computer interaction and mul- 
timedia communication. Age estimation is a type of soft biometric that provides 
ancillary information of the users’ identity information. The aging process is 
determined by not only the person’s gene, but also many external factors, such 
as health, living style, living location, and weather conditions. Males and females 
may also age differently. An age-specific human-computer interaction system may 
be developed for secure network/system access control, electronic customer rela- 
tionship management, security control and surveillance monitoring, biometrics, 
entertainment and cosmetology. The system can ensure that young kids have no 
access to Internet pages with adult materials. A vending machine, secured by 
the age-specific human-computer interaction system, can refuse selling alcohol 
or cigarettes to the underage people. Ad-agency can find out what kind of scroll 
advertisements can attract the passengers in what age ranges using a latent com- 
puter vision system. Computer-based age synthesis and estimation via faces have 
become particularly prevalent topics, such as forensic art. 

Age-group recognition from frontal face image [25] classifies subjects into four 
different age categories. A neural network is used to classify the face into age 
groups using computed facial feature ratios and wrinkle densities. Age synthesis is 
defined to rerender a face image aesthetically with natural aging and rejuvenating 
effects on the individual face. Age estimation is defined to label a face image 
automatically with the exact age or the age group of the individual face. The 
complete techniques in the face image based age synthesis and estimation topics 
are surveyed in [12]. 

Gender and race identifications from face images have many applications [26]: 
improving search engine retrieval accuracy, demographic data collection, and 
human-computer interfaces (adjusting the software behavior with respect to the 
user gender). Moreover, in a biometric recognition framework, gender identifi- 
cation can help by requiring a search of only half the subject database. Gender 
identification may aid shopkeepers to present targeted advertisements. One way 
to enhance gender identification techniques is to combine different cues such as 
face, gait and voice. 

Individuals can be authenticated by using triangulation of hand vein images 
and simultaneous extraction of knuckle shape information [18]. The method is 
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fully automated and it employs palm dorsal hand vein images acquired from the 
low-cost, near-infrared, contactless imaging. 


Behavioral biometrics and recognition 


From speech, one can identity a speaker, his/her mental status and gender, or the 
content of the speech. Speech features should be constant for each person when 
measured many times and in multiple samples, and should discriminate other 
people as much as possible. Emotion recognition from speech signals can also 
be conducted. Common features extracted for speech are maximum amplitude 
obtained by cepstral analysis, peak and average power spectrum density, num- 
ber of zero-crossings, formant frequency, pitch and pitch amplitude, and time 
duration. 

Automatic language identification is to quickly and accurately identify the 
language being spoken (e.g. English, Spanish, etc.). Language identification has 
numerous applications in a wide range of multi-lingual services. The language 
identification system can be used to route an incoming telephone call to a human 
operator fluent in the corresponding language. 

Speaker recognition systems utilize human speech to recognize, identify or 
verify an individual. It is important to extract features from each frame which 
can capture the speaker-specific characteristics. The same features adopted in 
speech recognition are equally successful when applied to speaker recognition. 
Mel-frequency cepstral coefficients (MFCCs) have been most commonly used 
in both speech recognition and speaker recognition systems. Linear prediction 
coefficients have received special attention in this respect. For each utterance 
contiguous 25 ms speech samples can be extracted and MFCC features obtained. 
Speaker segmentation aims at finding speaker change points in an audio stream, 
whereas speaker clustering aims at grouping speech segments based on speaker 
characteristics. 

The task of text-independent speaker verification entails verifying a particular 
speaker from possible imposters without any knowledge of the text spoken. A 
conventional approach uses a classifier to generate scores based on individual 
speech frames. The scores are integrated over the duration of the utterance and 
compared against a threshold to accept or reject the speaker. 

Speech emotion recognition can be implemented using speaking rate [17]. The 
emotions are anger, disgust, fear, happiness, neutral, sadness, sarcasm and sur- 
prise. At the first stage, based on speaking rate, the eight emotions are catego- 
rized into three broad groups namely active (fast), normal and passive (slow). In 
the second stage, these three broad groups are further classified into individual 
emotions using vocal tract characteristics. 

Visual speech recognition [5] uses the visual information of the speaker’s face 
in performing speech recognition. A speaker produces speech using these articu- 
latory organs together with the muscles that generate facial expressions. Because 
some of the articulators, such as the tongue, the teeth, and the lips are visible, 
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there is an inherent relationship between the acoustic and visible speech. Visual 
speech recognition is also referred to as lipreading. 

HMM-based speech synthesis first extracts parametric representations of 
speech including spectral (filter) and excitation (source) parameters from a 
speech database and then models them by using a set of subword HMMs. One 
of the key tasks of spoken-dialog systems is classification. 

Gait is an efficient biometric feature for human identification at a distance. 
Gait should be analyzed within complete walking cycle(s) because it is a kind of 
periodic action. In a window of complete walking cycle(s), gait (motion) energy 
image is constructed on spatial domain as gait feature. These methods can 
be mainly classified into three categories: model-based, appearance-based and 
spatiotemporal-based. For a walking person, there are two kinds of information 
in his/her gait signature: static and dynamic. This is also applicable for human 
action analysis. After the motion features are computed by obtaining sparse 
decomposition of the scale invariant feature transform (SIFT) features [41], an 
action is often represented by a collection of codewords in a predefined codebook; 
This is the bag-of-words model. 

Hand gesture recognition has been intensively applied in various human- 
computer interaction systems. Different hand gesture recognition methods are 
based on particular features, e.g., gesture trajectories and acceleration signals. 
Hand gestures can be used to exchange information with other people in a virtual 
space, to guide robots to perform certain tasks in a hostile environment, or to 
interact with computers. Hand gestures can be divided into two main categories: 
static and dynamic. 

A dynamic hand gesture recognition technique based on the 2D skeleton rep- 
resentation of the hand is given in [15]. For each gesture, the hand skeletons of 
each posture are superposed providing a single image which is the dynamic sig- 
nature of the gesture. The recognition is performed by comparing this signature 
with the ones from a gesture alphabet, using Baddeley’s distance as a measure 
of dissimilarities between model parameters. 

Handwriting-based biometric recognition [29] can be handwriting recognition, 
forensic verification, and user authentication. Techniques for handwritten charac- 
ter recognition are reviewed and compared in [19]. Typically, handwriting-based 
biometric verification and identification use signatures. Signature as proof of 
authenticity is a socially well-accepted transaction, especially for legal document 
management and financial transactions. Online signature verification provides 
a reliable authentication. Biometric signature verification techniques based on 
the dynamics of a person’s signature, namely, time series of pen-tip coordinates, 
writing forces, or inclination angles of a pen. However, signatures of a person 
may be very variable. 
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Face detection and recognition 


Research efforts in face processing are in face detection, face recognition, face 
tracking, pose estimation, and expression recognition. Face recognition is suffi- 
ciently mature and can be ported to real-time system. It is not robust in natural 
environments, where there is noise and illuminations and pose problems, and in 
real time and video environment. Some kinds of information fusions are necessary 
for reliable recognition. 


Face detection 


Face detection is the first essential step for automatic face recognition. Face 
detection is a difficult task in view of pose, facial expression, presence of some 
facial components and their variability such as beards, mustaches and glasses, 
texture, size, shape, color, occlusion, and imaging orientation and lighting. A 
survey of face detection techniques is given in [39]. Human skin color is an effec- 
tive feature in many applications from face detection to hand tracking. Although 
different people have different skin color, the major difference lies largely between 
their intensity rather than their chrominance. 

Training a neural network for the face detection task is challenging because 
of the difficulty in characterizing prototypical nonface images. Face detection is 
a typical 2-class problem to distinguish face class from non-face class: faces and 
images not containing faces. The boundary between the face and nonface patterns 
is highly nonlinear because the face manifold due to variations in facial appear- 
ance, lighting, head pose and expression is highly complex. The learning-based 
approach has so far been the most effective one for constructing face/nonface 
classifiers. 

PCA can be used for the localization of a face region. Due to the fact that 
color is the most discriminating feature of a facial region, the first step can be a 
pixel-based color segmentation to detect skin-colored regions. The performance 
of such a hierarchical system is highly dependent on the results of this initial 
segmentation. The subsequent classification based on shape may fail if only parts 
of the face are detected or the face region is merged with skin-colored background. 

Human face regions can be rapidly detected in MPEG video sequences [36]. 
The underlying algorithm takes the inverse quantized DCT coefficients of MPEG 
video as the input, and outputs the locations of the detected face regions. The 
algorithm consists of three stages, where chrominance, shape and frequency infor- 
mation are, respectively, used. By detecting faces directly in the compressed 
domain, there is no need to carry out inverse DCT. The algorithm detects 85- 
92% of the faces in three test sets, including both intraframe and interframe 
coded image frames from news video, without constraints on the number, size, 
and orientation of the faces. The algorithm can be applied to JPEG uncon- 
strained images or motion JPEG video as well. A robust face tracking system 
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presented in [45] extracts multiple face sequences from MPEG video. Specifically, 
a view-based DCT-domain face detection algorithm is first applied periodically 
to capture mostly frontal and slight slanting faces of variable sizes and loca- 
tions. The face tracker then searches the target faces in local areas across frames 
in both the forward and backward directions. The tracking combines color his- 
togram matching and skin-color adaptation to provide robust tracking. 

For visual surveillance, the input frames can be preprocessed to extract the 
information about moving objects by applying a change detection technique. 
Groups of connected pixels (blobs) belonging to the class of moving points rep- 
resent possible people in the scene. These blobs are then used to locate the faces. 
In [11], a real-time face detection system for color image sequences is presented. 
The system applies three different face detection methods and integrates the 
results so obtained to achieve a greater location accuracy. The extracted blob is 
then subject to outline analysis, skin color detection, and PCA. Outline analysis 
is applied to localize the human head. A skin color method is applied to the 
blobs to find skin regions. PCA is trained for frontal view faces only, and is used 
to classify if a particular skin region is a face or a non-face. Finally, the obtained 
face locations are fused to increase the detection reliability and to avoid false 
detections due to occlusions or unfavorable human poses. 

The accuracy of face alignment affects the performance of a face recognition 
system. Since face alignment is usually conducted using eye positions, an accurate 
eye localization algorithm is therefore essential for accurate face recognition. 
An automatic technique for eye detection is introduced in [37]. The automatic 
eye detection technique has an overall 94.5% eye detection rate using FRGC 
1.0 database, with the detected eyes very close to the manually provided eye 
positions. In [16], color, edge, and binary information are used to detect eye 
pair candidate regions from input image, and then face candidate region with 
the detected eye pair is extracted. The approach shows excellent face detection 
performance over 99.2%. 


Face recognition 


Face recognition approaches could be categorized into feature-based approaches 
and holistic approaches. Holistic matching methods use the whole face region as 
the raw input to a recognition system. In feature-based (structural) matching 
methods, local features such as the eyes, nose and mouth are first extracted and 
their locations and local statistics (geometric and/or appearance) are fed into a 
structural classifier. 

PCA is applied on the training set of faces, yielding the eigenfaces approach 
[35]. It assumes that the set of all possible face images occupies a low-dimensional 
subspace, derived from the original high-dimensional input image space. The 
eigenfaces algorithm represents a face with 50 to 100 coefficients and uses a 
global representation of the face, so that the algorithm is faster, with a global 
encoding of the face. 
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A multimodal-part face recognition method based on PCA [34] combines 
multimodals: Five kinds of parts are dealt with by PCA to obtain eigenfaces, 
eigeneyebrows, eigeneyes, eigennoses and eigenmouths. Thirty-one kinds of dif- 
ferent face recognition models are created by using combinations of different 
parts. This method has a higher recognition rate and more flexibility compared 
to the eigenfaces method. 

In the Fisherfaces approach, PCA is first used for dimension reduction before 
the application of LDA. The eigenfaces method derives the most expressive fea- 
tures, while Fisherfaces derives the most discrimitive features. Eigenfaces can 
typically achieve a classification rate of 90% for the ORL database, and Fish- 
erfaces can achieves 95%. Neural network approach using MLP, RBF network 
and LVQ can outperform Fisherfaces. All these methods belong to the holistic 
approach. Eigenfaces usually needs many (usually > 5) images for each person in 
the gallery. Either PCA or its derived methods are sensitive to variations caused 
by illumination and rotation. 


Example 24.1: Face recognition using Fisherfaces. In Example 12.3, we 
have given an example of face recognition using eigenfaces. For the recognition 
of C = 30 people by using N = 60 training samples, the method achieves a classi- 
fication rate of 100%. In this example, we rework the same problem by using the 
Fisherfaces approach. Fisherfaces first applies PCA to project the samples onto 
a (N — C)-dimensional linear subspace, and then applies LDA to project the 
resulted projections onto (C — 1)-dimensional linear subspace so as to determine 
the most discriminating features between faces. The combined weights are called 
Fisherfaces. When a new sample is presented, the projection on the weights is 
derived. The projection of the presented sample is compared with those of all 
the training samples, and the training sample that has the minimum difference 
from the test sample is classified as the correct class. Experiment on the set of 
30 testing samples shows a classification rate of 86.7% (26/30). 


ICA representations are superior to PCA representations for recognizing faces 
across days and changes in expression [2]. Associative memory is another kind 
of neural network for face recognition. In brief, associative memory based clas- 
sification learns how to do recognition by categorizing positive examples of a 
subject. 

Wavelets can be used to extract global as well as local features, such as the 
nose and eye regions of a face. Face representation based on Gabor features 
has attracted much attention and achieved great success in face recognition area 
because of the advantages of the Gabor features. Gabor wavelets model the recep- 
tive field profiles of cortical simple cells quite well. Gabor wavelets exhibit desir- 
able characteristics of spatial frequency, spatial locality, and orientation selectiv- 
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ity to cope with the variations due to illumination and facial expression changes. 
Moreover, to extract more local features from the original images, a series of 
Gabor filters with various scales and orientations (called Gabor filter bank) are 
needed in most cases of biometrics. 

An illumination normalization approach is presented for face recognition under 
varying lighting conditions in [6]. DCT is employed to compensate for illumina- 
tion variations in logarithmic domain. Since illumination variations mainly lie 
in the low-frequency band, an appropriate number of DCT coefficients are trun- 
cated to minimize variations under different lighting conditions. 

n [43], Gabor features, derived from the convolution of a Gabor filter and 
palmprint images, are represented and matched by hamming code and ham- 
ming distance, respectively. The Gabor-Fisher classifier for face recognition [21] 
is robust to changes in illumination and facial expression. It applies enhanced 
LDA to an augmented Gabor feature vector derived from the Gabor wavelet rep- 
resentation of face images. The Gabor-Fisher classifier achieves 100% accuracy 
on face recognition using only 62 features, tested on face recognition using 600 
FERET frontal face images corresponding to 200 subjects. Gabor-based kernel 
PCA [22] integrates the Gabor wavelet representation of face images and kernel 
PCA for face recognition. 

Much like recognition from visible imagery is affected by illumination, recog- 
nition with thermal face imagery is affected by a number of exogenous and 
endogenous factors such as weather conditions, environment change and sub- 
ject’s metabolism. And while the appearance of some features may change, their 
underlying shape remains the same and continues to hold useful information for 
recognition [33]. 

The Laplacianfaces approach [14] uses locality-preserving projections to map 
face images into a face subspace for analysis. Locality-preserving projections 
find an embedding that preserves local information, and obtains a face subspace 
that best detects the essential face manifold structure. The Laplacianfaces are 
the optimal linear approximations to the eigenfunctions of the Laplace Beltrami 
operator on the face manifold. The unwanted variations resulting from changes 
in lighting, facial expression and pose may be eliminated or reduced. The method 
provides a better representation and achieves lower error rates in face recognition 
than eigenfaces and Fisherfaces do. 

Video based face recognition method provides more information in a video 
sequence than in a single image. In order to take advantage of the large amount 
of information in the video sequence, a multiple classifiers fusion based video 
face recognition algorithm can be used. 

With the development of 3D imaging techniques, 3D face recognition is becom- 
ing a natural choice to overcome the shortcomings of 2D face recognition, since 
a 3D face image records the exact geometry of the subject, invariant to illu- 
mination and orientation changes. 3D model based face recognition is robust 
against pose and lighting variations. In 3D face recognition, registration is a key 
preprocessing step. 3D face recognition can be implemented using reconstructed 


ww ai bbt.com DOOOO000 


24.3 


Pattern recognition for biometrics and bioinformatics 771 


3D models from a set of 2D images [13]. The reconstructed 3D model is used 
to obtain the 2D projection images that are matched with probe images [20]. In 
[23], 3D models are used to recognize 2.5D face scans, provided by commercial 
3D sensors, such as Minolta Vivid series. A 2.5D scan is a simplified 3D (x,y,z) 
surface representation that contains at most one depth value (z direction) for 
every point in the (x,y) plane, associated with a registered texture image. 


Bioinformatics 


Statistical and computational problems in biology and medicine have created 
bioinformatics. Typical bioinformatics problems are protein structure prediction 
from amino acid sequencs, fold pattern recognition, homology modelling, multiple 
alignment, distant homology, motif finding, protein folding, phylogeny, resulting 
in a large number of NP-hard optimization problems. Prognostic prediction for 
the recurrence of a disease or the death of a patient is also a classification or a 
regression problem. 

The genetic information that defines the characteristics of living cells within an 
organism is encoded in the form of a moderately simple molecule, deoxyribonu- 
cleic nucleic acid (DNA). DNA is often represented as a string composed of four 
nucleotide bases with the chemical formulae: adenine (A, C5H5Ns5), cytosine (C, 
C4HsN30), guanine (G, Cs5H5N50) and thymine (T, Cs5HsN202). DNA has the 
ability to perform two basic functions: replicating itself and storing information 
on the linear composition of the amino acids in proteins. 

Ribonucleic acid (RNA) is a polymer of repeating units, namely ribonu- 
cleotides, with a structure analogous to single-stranded DNA. Sugar deoxyribose 
which appears in DNA is replaced in RNA by another sugar called ribose, and the 
base thymine (T) that appears in DNA is replaced in RNA by another organic 
base called uracil (U). Compared with DNA, RNA molecules are less stable and 
exhibit more variability in their three-dimensional structure. Although RNA and 
protein polymers seem to exhibit a much more variable functional spatial struc- 
ture than DNA does, their linear content is always a copy of a short fragment of 
the genome. 

Proteins are basic constructional blocks and functional elements of living 
organisms. In an organism, proteins are responsible for carrying out many dif- 
ferent functions in the life-cycle of the organism. Proteins are molecules with 
complicated three-dimensional structures, but they always have an underlying 
linear chain of amino acids as their primary structure. Each protein is a chain 
of 20 different amino acids in a specific order and it has unique functions. The 
length of protein is between 50 to 3000, with an average length of 200. The order 
of amino acids is determined by the DNA sequences in the gene which codes for 
a specific protein. 

The production of a viable protein from a gene is called gene expression, and 
the regulation of gene expression is a fundamental process necessary to maintain 
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the viability of an organism. Gene expression is a two-step process. To produce a 
specific protein, a specific segment on the genome codes a gene, the gene is first 
transcribed from DNA into a messenger RNA (mRNA), which is then converted 
into a protein via translation. The process of gene expression is regulated by 
certain proteins known as transcription factors. In both replication (creating a 
new copy of DNA) and transcription (creating an RNA sequence complementary 
toa DNA fragment), protein enzymes called polymerases slide along the template 
DNA strand. It is very difficult to measure the protein level directly because there 
are simply too many of them in a cell. Therefore, the levels of mRNA are used 
to identify how much a specific protein is presented in a sample, i.e. it gives an 
indication of the levels of gene expression. 

The genome is the ensemble of genes in an organism, and genomics is the 
study of the genome. The major goal of genomics is to determine the function 
of each gene in the genome (i.e., to annotate the sequence). The genome is the 
complete DNA sequence of an organism. Genomics is concerned with the analysis 
of gene sequences, including comparison of gene sequences, and analysis of the 
succession of symbols in sequences. 

The international Human Genome Project had the primary goal of determining 
the sequence of 3 billions of the senucleotide bases that make up DNA (i.e., DNA 
sequencing) and to identify the genes of the human genome from both a physical 
and functional standpoint. The project began in 1990 and was initially headed 
by James D. Watson at the U.S. National Institutes of Health. The complete 
genome was released in 2003. Approximate numbers of different objects in the 
human body are 30,000 genes within the human genome, 10° mRNA, 3 x 10° 
proteins, 103-104 expressed proteins, 250 cell types, and 1013-1014 cells [30]. The 
sequencing of entire genomes of various organisms has become one of the basic 
tools of biology. 

One of the greatest challenges facing molecular biology is the understanding of 
the complex mechanisms regulating gene expression. Identification of the motif 
sequences is the first step toward unraveling the complex genetic networks com- 
posed of multiple interacting gene regulation. The goal of gene identification 
is for automatic annotation. For genome sequencing and annotation, sequence 
retrieval and comparison, a natural toolbox for problems of this nature is pro- 
vided by HMMs and the Viterbi algorithm based on dynamic programming. 

The proteome is the set of proteins of an organism. It is the vocabulary of the 
genome. Via the proteome, genetic regulatory networks can be elucidated. Pro- 
teomics combines the census, distribution, interactions, dynamics, and expression 
patterns of the proteins in living systems. The primary task is to correlate the 
pattern of gene expression with the state of the organism. For any given cell, typ- 
ically only 10% of the genes are actually translated into proteins under a given 
set of conditions and at a particular epoch in the cell’s life. On the other hand, 
a given gene sequence can give rise to tens of different proteins, by varying the 
arrangements of the exons and by post translational modification. As proteins 
are the primary vehicle of phenotype, proteomics constitutes a bridge between 
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Figure 24.2 The relation among genes, mRNA, proteins and metabolites. The curved arrows denote 
regulatory processes [30] 


24.3.1 


genotype and phenotype. Comparison between the proteomes of diseased and 
healthy organisms forms the foundation of the molecular diagnosis of disease. 

Figure 24.2 highlights the principle objects of investigation of bioinformat- 
ics. The three main biological processes are DNA sequence determining protein 
sequence, protein sequence determining protein structure and protein structure 
determining protein function. 


Example 24.2: The consensus sequence for the human mitochondrial genome 
has the GenBank accession number NC_001807. The nucleotide sequence for the 
human mitochondrial genome has 16571 elements in form of a character array: 
gatcacaggtctatcaccctattaaccactcacgggagctctccatgcat 
ttggtattttcgtctggggggtgtgcacgcgatagcattgcgagacgctg 
gagccggagcaccctatgtcgcagtatctgtctttgattcctgcctcatt 

ctat... 

By analyzing a DNA sequence, sections of a sequence with a high percent 
of A+T nucleotides usually indicate intergenic parts of the sequence, while low 
A+T and higher G+C nucleotide percentages indicate possible genes. Often high 
CG dinucleotide content is located before a gene. Figure 24.3 gives the sequence 
statistics of the human mitochondrial genome. 


Microarray technology 


The idea of measuring the level of mRNA as a surrogate measure of the level of 
gene expression dates back to 1970s. The methods allow only a few genes to be 
studied at a time. Microarrays allow to measure mRNA levels in thousands of 
genes in a single experiment and check whether those genes are active, hyperac- 
tive or silent. A microarray is typically a small glass slide or silicon wafer, upon 
which genes or gene fragments are deposited or synthesized in a high-density 
manner. 
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Figure 24.3 The nucleotide desity of the human mitochondrial genome. 


DNA microarrays evaluate the behavior and inter-relationship between genes 
on a genomic scale. This is done by quantifying the amount of mRNA for that 
gene which is contained within the cell. DNA microarray technology permits a 
systematic study of the correlation of the expression of thousands of genes. It is 
very useful for drug development and test, gene function annotation, and cancer 
diagnosis. 

To make a microarray, the first stage is probe selection. It determines the 
genetic materials to be deposited or synthesized on the array. The genetic mate- 
rials deposited on the array serve as probes to detect the level of expressions for 
various genes in the sample. Each gene is represented by a single probe. A probe 
is normally single stranded (denatured) DNA, so the genetic material from the 
sample can bind with the probe. Once the probes are selected, each type of probe 
will be deposited or synthesized on a predetermined spot on the array. Each spot 
will have thousands of probes of the same type, so the level of intensity detected 
at each spot can be traced back to the corresponding probe. A sample is prepared 
after the microarray is made. The mRNA in the sample is first extracted and 
purified, then is reverse transcribed into single stranded DNA and a fluorescent 
marker is attached to each transcript. The single stranded DNA transcript will 
only bind with the probe that is complementary with the transcript, that is, 
binding will only occur if the DNA transcript from the sample is coming from 
the same gene as the probe. By measuring the amount of fluorescence in each 
spot using a scanner, the level of expression of each gene can be measured. The 
result of a DNA microarray experiment is shown in Fig. 24.4. 

A microarray is a small chip onto which a large number of DNA molecules 
(probes) are attached in fixed grids. The chip is made of chemically coated glass, 
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Figure 24.4 The result of an DNA microarray experiment. The fluorescence intensity level at each spot 
indicates the corresponding gene’s expression level. 


nylon, membrane or silicon. Each grid cell of a microarray chip corresponds to 
a DNA sequence. For cDNA (complementary DNA) microarray experiment, the 
first step is to extract RNA from a tissue sample and amplification of RNA. 
Thereafter two mRNA samples are reverse-transcribed into cDNA labelled using 
red-fluorescentdye Cy5 and green-fluorescentdye Cy3. The cDNA binds to the 
specific oligonucleotides on the array. In the subsequent stage, the dye is excited 
by a laser so that the amount of cDNA can be quantified by measuring the flu- 


orescence intensities. The log ratio of two intensities of each dye is used as the 
intensity (Cy5) 
intensity (Cy3) ° 
ray gene expression data matrix is a real-valued n x d matrix, where the rows 


gene expression profiles: gene expression level = logs A microar- 
correspond to n genes and the columns correspond to d conditions (or time 
points). 

Protein microarrays allow a simultaneous assessment of expression levels for 
thousands of genes across various treatment conditions and time. The main dif- 
ference compared with nucleic acid arrays is the difficulty and expense of placing 
thousands of protein capture agents on the array. 

Labeling exhaustively all the experimental samples can be simply impossible. 
For extracting knowledge from such huge volume of microarray gene expres- 
sion data, computational analysis is required. Clustering is one of the primary 
approaches to analyze such large amount of data to discover the groups of coex- 
pressed genes. In [27], an attempt has been made in order to improve the perfor- 
mance of fuzzy clustering by combining it with SVM classifier. Genetic linkage 
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analysis is a statistical method for mapping genes onto chromosomes, and deter- 
mining the distance between them. A tool called RC_Link [1] can model genetic 
linkage analysis problems as Bayesian networks and perform inference efficiently. 

The regulation of gene expression is a dynamic process. There are inherent 
difficulties in discovering regulatory networks using microarray data [9]. Genetic 
networks provide a concise representation of the interaction between multiple 
genes at the system level. The inference of genetic regulatory networks based on 
the time-series gene expression data from microarray experiments has become 
an important and effective way to achieve this goal. Genetic regulatory network 
inference is critically important for revealing fundamental cellular processes, 
investigating gene functions, and understanding their relations. Pharmaceutical 
companies can test how cells will react to new drug treatments by observing gene 
expression patterns pre- and post-treatment. Genetic regulatory networks can be 
inferred by using Boolean networks, Bayesian networks, continuous models, and 
fuzzy logic models. Genetic regulatory networks can be inferred from time-series 
gene expression data by using a recurrent network and particle swarm optimiza- 
tion, where gene interaction is explained through a connection weight matrix 
[38]. 


Motif discovery, sequence alignment, protein folding, and coclustering 


The motif is often defined more formally along the lines of a sequence of amino 
acids that defines a substructure in a protein, which can be connected in some 
way to protein function or structural stability. Motifs are defined as transcription 
binding sites in DNA sequences. Identification of motifs is a difficult task since 
they are very short (generally 6-20 nucleotides long) and may have variations as 
a result of mutations, insertions, and deletions [8]. Moreover, these genomic pat- 
terns reside on very long DNA sequences, which make the task irresolvable for 
traditional computational methods. In general, many algorithms either adopt 
probabilistic methods such as Gibbs sampling, or based on word-enumerative 
exhaustive search methods. Various motif discovery tools are evaluated and com- 
pared with respect to their performances in [8]. DNA motif discovery helps to 
better understand the regulation of the transcription in the protein synthesis 
process. 

In molecular biology, DNA or protein sequences store genetic information. 
A typical bioinformatic problem is provided by polymorphisms in the human 
genome. Any two human DNA sequences differ at random points located, on aver- 
age, several hundred nucleotides apart. The human genome sequence is based on 
only a few individuals. Sequence alignment is used to arrange sequences of DNA, 
RNA, or proteins in order to identify regions of similarity as these regions might 
be a consequence of functional, structural, or evolutionary relationships between 
the sequences. BLAST and ClustalW are well-known tools for sequence align- 
ment by calculating the statistical significance of matches to sequence databases. 
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Figure 24.5 Sequence alignment of the human and mouse amino acid sequences. The color for 
highlighting the matches is red (|) and the color for highlighting similar residues that are not exact 
matches is magenta (:). 


GenBank and EMBL-Bank are sequence databases, which provide annotated 
DNA sequences. 

Computational approaches to sequence alignment are generally dynamic 
programming-based optimal methods or heuristic methods. The Smith- 
Waterman algorithm [32] is a dynamic programming technique for global pair- 
wise alignment of two DNA sequences. The most widely used methods for 
multiple sequence alignment include scalar-product based alignment of groups 
of sequences. Scalar-product based alignment algorithms can be significantly 
speeded up by general-purpose GPUs [3]. The local alignment problem is stated 
as follows. Given a set of unaligned biological (DNA or protein) sequences, locate 
some recurrent short sequence elements or motifs that are shared among the set 
of sequences. 


Example 24.3: Sequence alignment functions can be used to find similarities 
between two nucleotide sequences. Alignment functions return more biologi- 
cally meaningful results when you are using amino acid sequences. Nucleotide 
sequences are converted to amino acid sequences and the open reading frames 
are identified. For the human and mouse amino acid sequences, the alignment 
is very good between amino acid position 69 and 599. Realigning the trimmed 
sections with high similarity, the percent identity is 84% (446/530). A segment 
of the sequence alignment is shown in Fig. 24.5. 


The protein folding problem is to predict a protein’s three-dimensional struc- 
ture from its one-dimensional amino-acid sequence. If this problem is solved, 
rapid progress will be made in the field of protein engineering and rational drug 
design. NMF in combination with three nearest-neighbor classifiers is explored 
for protein fold recognition in [28]. 
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Extracting biologically relevant information from DNA microarrays is a very 
important task for drug development and test, function annotation, and can- 
cer diagnosis. When analyzing the large and heterogeneous collections of gene 
expression data, conventional clustering algorithms often cannot produce a satis- 
factory solution. Coclustering refers to the simultaneous clustering of both rows 
and columns of a data matrix [24]. The goal is to find submatrices, that is, 
subgroups of genes and subgroups of conditions, where the genes exhibit highly 
correlated activities for every condition. Minimum sum-squared residue coclus- 
tering algorithm [7] is a residue-based coclustering algorithms that simultane- 
ously identify coclusters with coherent values in both rows and columns via an 
alternating C-means-like iterative algorithm, resulting in a checkerboard struc- 
ture. Specific strategies are proposed to enable the algorithm to escape poor local 
minima and resolve the degeneracy problem in partitional clustering algorithms. 
For microarray data analysis, matrix decomposition methods such as SVD [40] 
and NMF [4] are used to detect biclusters in gene expression profiles. 
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25.1 


Data mining 


Introduction 


The Web is the world’s largest source of information. It records the real world 
from many aspects at every moment. This success is somewhat thanks to XML- 
based technology, which provides a means of information interchange between 
applications, as well as a semistructured data model for integrating informa- 
tion and knowledge. Information retrieval has enabled the development of useful 
web search engines. Relevance criteria based on both textual contents and link 
structure are very useful for effectively retrieving text-rich documents. 

The wealth of information in huge databases or the Web has aroused tremen- 
dous interest in the area of data mining, also known as knowledge discovery in 
databases (KDD). Data mining refers to a variety of techniques in the fields of 
databases, machine learning and pattern recognition. The objective is to uncover 
useful patterns and associations from large databases. Data mining is to auto- 
matically search large stores of data for consistent patterns and/or relationships 
between variables so as to predict future behavior. The process of data min- 
ing consists of three phases, namely, data preprocessing and exploration, model 
selection and validation, as well as final deployment. Structured databases have 
well-defined features and data mining can easily succeed with good results. Web 
mining is more difficult since the World Wide Web is a less structured database. 

There are three types of web mining in general: web structure mining, web 
usage mining (context mining), and web content mining. Content mining unveils 
useful information about the relationships of web pages based on their content. In 
a similar way context mining unveils useful information about the relationship of 
web pages based on past visitor activity. Context mining is usually applied on the 
access-logs of the web site. Some of the most common data items found in access- 
logs are the IP address of the visitor, the date and time of the access, the time 
zone of the visitor, the size of the data transferred, the URL accessed, the protocol 
used and the access method. The data stored in access-logs is configurable at the 
web server with the items mentioned above appearing in most access-logs. 

Machine learning provides the technical basis of data mining. Data mining 
needs first to discover the structural features in a database, and exploratory 
techniques through self-organization such as clustering are particularly promis- 
ing. Neurofuzzy systems are ideal tools for knowledge representation. Bayesian 
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networks provide a consistent framework to model the probabilistic dependencies 
among variables. Classification is also a fundamental method in data mining. 
Raw data contained in databases typically contains obsolete or redundant 
fields, outliers, and values not allowed. Data cleaning and data transformation 
may be required for data mining. A graphical method for identifying outliers for 
numeric variables is to examine a histogram of the variable. Outlier mining can 
be used in telecom or credit card frauds to detect the atypical usage of telecom 
services or credit cards, in medical analysis to test abnormal reactions to new 
medical therapies, and in marketing and customer segmentations to identify 
customers spending much more or much less than the average customer. 


Document representations for text categorization 


Document classification requires first to transform text data to numerical data. 
This is the vectorization step. The widely used vector space model [99] for doc- 
ument representation is commonly referred to as the bag-of-words model. Docu- 
ments are represented as a feature vector of word frequencies (real numbers) of 
the terms (words) that appear in all the document set. The vector space model 
represents a document by a weighted vector of terms. A weight assigned to a 
term represents the relative importance of that term. One common approach 
for term weighting uses the frequency of occurrence of a particular word in the 
document to represent the vector components [99]. Each document can be rep- 
resented as term vector a = (d1,d2,...,@n) in term space, where each term a; 
has a associated weight w; that denotes the normalized frequency of the word in 
the vector space, and n is the number of term dimensions. 

The most widely used weighting approach for term weights is the combination 
of term frequency and inverse document frequency (tf-idf) [95]. The inverse doc- 
ument frequency is the inverse of the number of documents in which a word is 
present in the training data set. Thus, less weight is given to words which occur 
in larger number of documents, ensuring that the commonly-occuring words are 
not given undue importance. The weight of term 7 in document j is defines as 


wy = t fji x idf = t fji x log,(N/df), (25.1) 


where t fji is the numbers of occurrences of term 7 in the document j, df is the 
total term frequency in a data set and N is the number of documents. tf-idf takes 
into account the distribution of the words in the documents, or its variant tfc 
[57] also considers the different lengths of the documents. The document score is 
based on the occurrence of the query terms in the document. This representation 
can use Boolean features indicating whether a specific word occurs in a document 
or not. It can use the absolute frequency of a word (tf), as used in [23]. 
Techniques such as tf-idf vectorize the data easily. However, since each vec- 
torized word is used to represent the document, this leads to the number of 
dimensions being equal to the number of words. Feature reduction can be per- 
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formed by removing the stop words (e.g. articles, prepositions, conjunctions and 
pronouns), and by mapping words with the same meaning to one morphologi- 
cal, which is known as stemming [99]. The Porter algorithm [23] strips common 
terminating strings (suffixes) from words in order to reduce them to their roots 
or stems. Stemming removes suffixes and prefixes to avoid duplicate entries and 
removing basic stop words. 

List all the words (after stemming) in all the training documents sorted by the 
document-frequency (i.e. the number of documents it appears in). Choose the 
top m such words (called keywords) according to this dictionary frequency. For 
binary representation of a specific document, choose the m-dimensional binary 
vector where the ith entry is 1 if the ith keyword appears in the document and 
0 if it does not. For frequency representation, choose an m-dimensional real- 
valued vector, where the ith entry is the normalized frequency of appearance of 
the ith keyword in the specific document. For the tf-idf representation, choose 
the m-dimensional real-valued vector. 

Proximity methods in text retrieval provide higher scores to documents that 
contain query terms A and B separated by x terms compared to documents that 
contain query terms A and B separated by y terms, where x < y. The vector 
space model would give the same score to both documents, but it is much faster. 
Phrase-based analysis means that the similarity between documents should be 
based on matching phrases rather than on single words only. 

A well-known normalization technique is the cosine normalization, in which 
the weight w; of term i is computed as 


tfi idf; 
V Xiz (tfi ` idfi)?’ 
where tf; denotes the term frequency of a;, and idf; denotes the inverse document 
frequency. 


(25.2) 


Wi = 


Similarity between documents includes the cosine measure. The cosine sim- 
ilarity between two documents with weight vectors u = (u1,..., Un) and v = 
(v1, ..., Un) is given by [99] 


Xi F(u) f(v) 
Viia fU VD fU 


where f(-) is a damping function such as the square root or the logarithmic 
function. 

The basic assumption is that the similarity between two documents is based 
on the ratio of how much they overlap to their union, all in terms of phrases. 
The term-based similarity measure is given as [69] 


cosine(u, v) = (25.3) 


_ ices 

|d: |||" 
where the vectors dı and dz represent term weights calculated using tf-idf weight- 
ing scheme. 


sim; (dy, d2)” = cos(d1, d2) (25.4) 
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Spectral document ranking methods employing the DFT [89] use the term 
spectra in the document score calculation. They provide higher precision than 
other text retrieval methods, such as vector space proximity retrieval methods. 
Spectral document ranking using DFT relies on much longer query times. Spec- 
tral document ranking can be based on DCT [90]. By taking advantage of the 
properties of DCT and by employing the vector space model, queries can be 
processed as fast as the vector space model does and a much higher precision is 
achieved [90]. 

The simple minded independent bag-of-words representation remains very 
popular. Other more sophisticated techniques for document representation are 
ones that are based on higher-order word statistics, string kernels [74]. 


Neural network approach to data mining 


Classification-based data mining 


Text classification is a supervised learning task for assigning text documents to 
pre-defined classes of documents. It is used to find valuable information from 
a huge collection of text documents available in digital libraries, knowledge 
databases, and the Web. Several characteristics have been observed in vector 
space based methods for text classification [99], including the high dimensional- 
ity of the input space, sparsity of document vectors, linear separability in most 
text classification problems, and the belief that few features are irrelevant. 

Centroid-based classification is one of the simplest classification methods. A 
test document is assigned to a class that has the most similar centroid. Using 
the cosine similarity measure, the original form of centroid-based classification 
finds the nearest centroid and assigns the corresponding class as the predicted 
class. 

The centroid, orthogonal centroid, and LDA/GSVD methods are designed for 
reducing the dimension of clustered data to reduce the dimension of the docu- 
ment vectors dramatically [60]. LSI/SVD does not attempt to preserve cluster 
structure upon dimension reduction. The prediction accuracies for orthogonal 
centroid rival those of the full space, even though the dimension is reduced to 
the number of clusters. 

A representation technique that is based on word-clusters is studied in [7]. Text 
categorization based on this representation can outperform categorization based 
on the bag-of-words representation, although the performance that this method 
achieves may depend on the chosen data set. The information bottleneck cluster- 
ing [7] is applied for generating document representation in a word-cluster space, 
where each cluster is a distribution over document classes. The combination of 
SVM with word-cluster representation is compared with SVM-based categoriza- 
tion using the simpler bag-of-words representation. When the contribution of 
low frequency words to text categorization is significant (the 20 Newsgroups), 
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the method based on word clusters significantly outperforms the word-based 
representation in terms of categorization accuracy or representation efficiency. 
On two other sets (Reuters-21578 and WebKB) the word-based representation 
slightly outperforms the word-cluster representation. 

SVM using the standard word frequencies as features yield good performance 
on a number of benchmark problems [57]. String kernels [74] are used to com- 
pute document similarity based on matching non-consecutive subsequences of 
characters. In string kernels, the features are the extent to which all possible 
ordered subsequences of characters are represented in the document. Text cate- 
gorization using string kernels operating at the character level yields performance 
comparable to kernels based on the bag-of-words representation. Furthermore, 
as gaps within the sequence are allowed, string kernels could also pick up stems 
of consecutive words. Example of the sequence kernels are string kernel, syllable 
kernel, and word-sequence kernel. The problem of categorizing documents using 
kernel-based methods such as SVMs is addressed in [13]. This technique is used 
with sequences of words rather than characters. This approach is computation- 
ally more efficient and it ties in closely with standard linguistic preprocessing 
techniques. A kernel method commonly used in the field of information retrieval 
is the ranking SVM [58]. 

The fast condensed nearest-neighbor rule [6] was introduced for computing a 
training-set-consistent subset for the nearest-neighbor decision rule. Condensa- 
tion algorithms for the nearest-neighbor rule can be applied to huge collections 
of data. The method is order-independent, its worst-case time complexity is 
quadratic but often with a small constant prefactor, and it is likely to select 
points very close to the decision boundary. The method outperforms existing 
competence preservation methods both in terms of learning speed and learning 
scaling behavior and, often, in terms of the size of the model while it guarantees 
the same prediction accuracy. It is three orders of magnitude faster than hybrid 
instance-based learning algorithms. 

In a method for spam filtering [118], instead of using keywords, the spamming 
behaviors are analyzed and the representative ones are extracted as features 
for describing the characteristics of e-mails. Spamming behaviors are identified 
according to the information recorded in headers and syslogs of e-mails. An 
enhanced MLP model is considered for two-pass classification: determining the 
spamming behaviors of incoming e-mails and identifying spam according to the 
behavior-based features. Since spamming behaviors are infrequently changed, 
compared with the change frequency of keywords used in spams, behavior-based 
features are more robust with respect to the change of time; thus the behavior- 
based filtering mechanism outperforms keyword-based filtering. 

The notion of privacy-preserving data mining addresses the problem of per- 
forming data analysis on distributed data sources with privacy constraints. Nec- 
essary information is exchanged between several parties to compute aggregate 
results without sharing the private content with one another. A solution is to 
add noise to the source data. BiBoost (bipartite boosting) and MultBoost (mul- 
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tiparty boosting) [42] allow two or more participants to construct a boosting 
classifier without explicitly sharing their data sets. The algorithms inherit the 
excellent generalization performance of AdaBoost, but are better than AdaBoost 
executed separately by the participants, and that, independently of the number of 
participants, they perform close to AdaBoost executed using the entire dataset. 

Since support vectors are intact tuples taken from the training data set, releas- 
ing an SVM classifier for public use or to clients will disclose the private content 
of support vectors. Privacy-preserving SVM classifier [72] is designed to not dis- 
close the private content of support vectors for the Gaussian kernel function. It 
is robust against adversarial attacks. 

The iJADE web miner for web mining application in the context of Internet 
shopping [66] is based on the integration of neurofuzzy-based web mining tech- 
nology and intelligent visual data mining technology to automate user authenti- 
cation. It provides automatic human-face identification and recognition as well 
as interactive and mobile agent-based product search from a large database. 


Clustering-based data mining 


Clusters provide a structure for organizing a large number of information sources 
for efficient browsing, searching and retrieval. Document clustering is one of 
the most important text mining methods developed to help users effectively 
navigate, summarize and organize text documents. The clustering techniques 
exploit naturally the graph formed by hyperlinks connecting documents to one 
another. Document clustering can be used to browse a collection of documents 
or to organize the results returned by a search engine in response to a user’s 
query. For document clustering, the features are words and the samples are doc- 
uments. Some data-mining approaches that use clustering are database segmen- 
tation, predictive modeling, and visualization of large databases. The topology- 
preserving property for SOM makes it particularly suitable for web information 
processing. 

Latent topic models such as latent semantic indexing (LSI) [32], probabilistic 
latent semantic analysis [110] and latent Dirichlet allocation [8] provide a sta- 
tistical approach to semantically summarize and analyze large-scale document 
collections based on the bag-of-words assumption. Word clusters are called top- 
ics. Latent semantic indexing (LSI) [32] is a spectral clustering method. In LSI, 
each document is represented by a histogram of word counts over a vocabulary 
of fixed size. The problem of polysemy and synonymy are considered in [32]. 
In probabilistic latent semantic analysis [110], the number of parameters grows 
linearly with the size of the training data, subject to overfitting. To address 
the problem, the latent Dirichlet allocation model [8] introduces priors over the 
parameters into the probabilistic latent semantic analysis model. 

The latent topic models are extended to semantically learn the underlying 
structure of long time series based on the bag-of-words representation [115]. 
Each time series is treated as a text document, and a set of local patterns from 
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the sequence is extracted as words by sliding a short temporal window along the 
sequence. 

An efficient disk-based implementation of C-means [85] is designed to work 
inside a relational database management system. It can cluster large data sets 
having very high dimensionality. In general, it only requires three scans over the 
data set and one additional one-time run to compute the global mean and covari- 
ance. It is optimized to perform heavy disk I/O and its memory requirements 
are low. Its parameters are easy to set. 

Co-clustering is an approach in which both words and documents are clustered 
at the same time [33]. It is assumed that all documents belonging to a partic- 
ular cluster or category refer to a certain common topic. The topics of genuine 
categories are naturally best described by the titles given to these categories by 
human experts. 

Applying FCM, fuzzyART, fuzzyART for fuzzy clusters, fuzzy max-min, and 
the Kohonen network to document clustering with the bibliographic database 
LISA, the best results were found with Kohonen algorithm which also orga- 
nizes the clusters topologically [46]. The testbed database contained documents 
of large size and others that were very small and differed very little from one 
another. 

A clustering algorithm proposesd in [62] uses supervision in terms of relative 
comparisons, viz., x is closer to y than to z. The clustering algorithm simultane- 
ously learns the underlying dissimilarity measure while finding compact clusters 
in the given data set using relative comparisons. The merits of building text 
categorization systems by using supervised clustering techniques are discussed 
in [1]. 

Proximity FCM is an extension of FCM incorporating a measure of similarity 
or dissimilarity as user’s feedback on the clusters during web navigation [75]. The 
algorithm consists of two main phases that are realized in interleaved manner. 
The first phase is primarily FCM applied to the patterns. The second concerns an 
accommodation of the proximity-based hints and involves some gradient-oriented 
learning. Proximity FCM offers a relatively simple way of improving the web page 
classification according to the user interaction with the search engine. 

Clustering problems have been studied for a data stream environment in [47]. 
Clustering continuous data streams allows for the observation of the changes 
of group behavior. It is assumed that at each time instant, data points from 
individual streams arrive simultaneously, and the data points are highly correl- 
ative to previous ones in the same stream. Clustering on demand framework 
[27] dynamically clusters multiple data streams. It realzes online collection of 
statistics in a single data scan as well as compact multiresolution approxima- 
tions. The framework consists of two phases, namely, the online maintenance 
phase and the offline clustering phase. The online maintenance phase provides 
an efficient mechanism to maintain summary hierarchies of data streams with 
multiple resolutions in time linear in both the number of streams and the number 
of data points in each stream. An adaptive clustering algorithm is devised for 
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the offline phase to retrieve approximations of desired substreams from summary 
hierarchies according to clustering queries. 

In data intensive peer-to-peer environments, distributed data mining algo- 
rithms that avoid large-scale synchronization or data centralization offer an 
alternate choice. In distributed C-means clustering [29], the data and computing 
resources are distributed over a large peer-to-peer network. Approximate C- 
means clustering without centralizing the data [29] can approximate the results 
of a centralized clustering at reasonable communication cost. In distributed data 
mining, adopting a flat node distribution model can affect scalability. A hierarchi- 
cally distributed peer-to-peer (HP2PC) architecture and a clustering algorithm 
are proposed to address the problem of modularity, flexibility, and scalability 
[49]. 

Integrating data mining algorithms with a relational DBMS is an important 
problem. Three SQL implementations of the C-means clustering are introduced 
to integrate it with a relational DBMS in [86]. C-means implementations in SQL 
and C++ are compared with respect to speed and scalability and also study the 
time to export data sets outside of the DBMS in [86]. SQL overhead is significant 
for small data sets, but relatively low for large data sets, whereas export time for 
running clustering outside the DBMS becomes a bottleneck for C++, making 
SQL a more efficient choice. 

Document clustering performance can be improved significantly in lower 
dimensional linear subspaces. NMF [119] and concept factorization [120] have 
been applied to document clustering with impressive results. Locally consistent 
concept factorization [11] is an approach to extract the document concepts which 
are consistent with the manifold geometry such that each concept corresponds 
to a connected component. A graph model is used to capture the local geom- 
etry of the document submanifold. By using the graph Laplacian to smooth 
the document-to-concept mapping, locally consistent concept factorization can 
extract concepts with respect to the intrinsic manifold structure and thus docu- 
ments associated with the same concept can be well clustered. 

WordNet [79] is a widely used ontology. This English ontology contains gen- 
eral terms organized in synsets (sets of synonymous terms) related using semantic 
relations. It comprises a core ontology and a lexicon. It is an online lexical refer- 
ence system which organizes nouns, verbs, adjectives and adverbs into synonym 
sets (synsets). ANNIE is an information extraction component of GATE [26]. An 
unsupervised method uses ANNIE and WordNet lexical categories and WordNet 
ontology in order to create a well structured document vector space whose low 
dimensionality allows common clustering algorithms to perform well [96]. 


Self-organizing maps for data mining 

Current techniques for web log mining utilize the content, such as WEBSOM 
[63], or the context [81] of the web site. The WEBSOM system [63] utilizes 
SOM to cluster web pages based on their content. An SOM-based method [92] 
utilizes both content and context mining clustering techniques to help visitors 
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identify relevant information quicker. The input of the content mining is the set 
of web pages of the web site, whereas the source of the context mining is the 
access-logs of the web site. A scalable parallel implementation of SOM suitable 
for data-mining applications is described in [65]. The parallel algorithm is based 
on the batch SOM formulation. Performance measurements on an SP2 parallel 
computer demonstrate essentially linear speedup for the parallel batch SOM 
algorithm. 

Correspondence analysis is a technique for analyzing the relations existing 
between the modalities of all the qualitative variables by completing a simulta- 
neous projection of the modalities. SOM can be viewed as an extension of PCA 
due to its topology-preserving property. For qualitative variables, SOM has been 
generalized for multiple correspondence analysis [24]. 

SOM is not suitable for nonvectorial data analysis such as the structured data 
analysis. Examples of structured data are temporal sequences such as the time 
series, language and words, spatial sequences like the DNA chains, and tree- or 
graph-structured data arising from natural language parsing and from chemistry. 
Some unsupervised self-organizing models for nonvectorial data are temporal 
Kohonen map, recurrent SOM, recursive SOM, SOM for structured data, and 
merge SOM. All these models introduce recurrence into SOM. These models have 
been reviewed and compared in [48, 106]. Time-organized map [116] provides a 
better understanding of the self-organization and geometric structure of corti- 
cal signal representations. The main additional idea of the time-organized map 
compared with SOM is the functionally reasonable transfer of temporal signal 
distances into spatial signal distances in topographic neural representations. 

A generalized SOM model [54] offers an intuitive method of specifying the 
similarity between categorical values via distance hierarchies and, hence, enables 
the direct process of categorical values during training. In fact, distance hierarchy 
unifies the distance computation of both numeric and categorical values. 


Bayesian network based data mining 


Bayesian network models were first introduced in information retrieval in [112], 
where index terms, documents and user queries are seen as events and are repre- 
sented as nodes in a Bayesian network. Bayesian networks have also been applied 
to other information retrieval problems besides ranking as, for example, assign- 
ing structure to database queries [12], and document clustering and classification 
[39]. 

Bayesian networks provide an effective and flexible framework for modeling 
distinct sources of evidence in support of a ranking. Bayesian networks can be 
used to represent the vector space model and this basic representation can be 
extended to naturally incorporate new evidence from distinct information sources 
[31]. 

Latent Dirichlet allocation [8] is a hierarchical Bayesian model that can infer 
probabilistic topics from the document-word matrix by using the variational 
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Bayes method. In [124], the latent Dirichlet allocation model is represented as a 
factor graph, and the loopy belief propagation algorithm is used for approximate 
inference and parameter estimation. 

Automatic subject indexing from a controlled vocabulary [44] and hierarchical 
text classification [101] are difficult. As each descriptor in the thesaurus repre- 
sents a different class/category and a document may be associated with several 
classes, it is a multi-label problem of high dimensionality; there are explicit (hier- 
archical) relationships between the class labels; the training data can be quite 
unbalanced to each class. In [30], given a document to be classified, a method is 
described that automatically generates an ordered set of appropriate descriptors 
extracted from a thesaurus. The method creates a Bayesian network to model 
the thesaurus and uses probabilistic inference to select the set of descriptors hav- 
ing high posterior probability of being relevant given the available evidence (the 
document to be classified). The model can be used without having preclassified 
training documents. 

An enhanced hybrid classification method is implemented in [55] through the 
utilization of the naive Bayes approach and SVM. The Bayes formula was used to 
vectorize a document according to a probability distribution reflecting the prob- 
able categories that the document may belong to. The dimensions are reduced 
from thousands (equal to the number of words in the document when using tf-idf) 
to typically less than 20 (number of categories the document may be classified 
to) through the use of the Bayes formula and then this probability distribution 
is fed to SVM for training and classification purposes. 


Personalized search 


Web search engines are built to serve all users. Personalization of web search is to 
carry out retrieval for each user incorporating his/her interests. One approach of 
personalized search is to filter or rerank search results by checking content sim- 
ilarity between returned web pages and user profiles. User profiles store approx- 
imations of user interests. User profiles are either specified by users themselves 
[94], [18] or are automatically learnt from a user’s historical activities [104]. A 
user profile is usually structured as a concept/topic hierarchy. User profiles are 
built by two groups of works: topical categories [94], [18] or keyword lists (bags 
of words) [102], [108], [107]. 

Several approaches represent user interests by using topical categories. User- 
issued queries and user-selected snippets/documents are categorized into concept 
hierarchies that are accumulated to generate a user profile. When the user issues 
a query, each of the returned snippets/documents is also classified. The docu- 
ments are reranked based on how well the document categories match user inter- 
est profiles. In topical-interest-based personalization strategies [37], user profiles 
are automatically learned from users’ past queries and click-throughs in search 
engine logs. Some other personalized search approaches use lists of keywords to 
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represent user interests. In [107], user preferences are built as vectors of distinct 
terms and are constructed by aggregating past preferences, including both long- 
term and short-term preferences. Language modeling is used to mine contextual 
information from a short-term search history [102], and to mine context from a 
long-term search history [108]. In [18], keywords are associated with categories, 
and thus, user profiles are represented by a hierarchical category tree based on 
keyword categories. 

Test queries can be divided into three types: clear queries, semiambiguous 
queries, and ambiguous queries. Personalization significantly increases output 
quality for ambiguous and semiambiguous queries; but for clear queries, a com- 
mon web search and current web search ranking might be sufficient, and thus, 
personalization is unnecessary [18]. Queries can also be divided into fresh queries 
and recurring queries. The recent history tends to be much more useful than the 
remote history, especially for fresh queries, whereas the entire history is helpful 
for improving the search accuracy of recurring queries [108]. 

Personalized PageRank is a modification of the PageRank algorithm for per- 
sonalized web search [87]. Multiple personalized PageRank scores, one for each 
main topic of the ODP (Open Directory Project) category hierarchy, are used to 
enable topic-sensitive web search [51]. The HITS algorithm is extended in [109] 
by artificially increasing the authority and hub scores of pages marked relevant 
by the user in previous searches. In the personalized search approaches based on 
user group, search histories of users who have similar interests with a test user 
are used to refine the search. Collaborative filtering is used to construct user pro- 
files, and is a typical group-based personalization method used in personalized 
search in [107]. 

Personalization may be ineffective for queries that show less variation among 
individuals. Click entropy is a simple measurement on whether a query should 
be personalized [37]. Click entropy measures the variation in user information 
needs for a query q as follows. Click entropy is a direct indication of query 
click variation. If all users click only one identical page on query q, we have 
ClickEntroy(q) = 0. A smaller click entropy means that the majority of users 
agree with one another on a small number of web pages. In such cases, there is 
no need to do any personalization. A large click entropy indicates that many web 
pages were clicked for the query. In this case, personalization can help to filter the 
pages that are more relevant to users by making use of historical selections. In 
this case, personalization can be used to provide different web pages to different 
users. A method that incorporates click histories of a group of users with similar 
topical affinities to personalize web search is given in [37]. A large-scale evaluation 
framework is presented for personalized search based on query logs and then 
five personalized search algorithms (including two click-based ones and three 
topical-interest-based ones) are evaluated using 12-day query logs of Windows 
Live Search. It is found that no personalization algorithm can outperform others 
for all queries [37]. 
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Web search ranking requires labeled data. The labels usually come in the 
form of relevance assessments made by editors. The relevance labels of the train- 
ing examples could change over time. Click logs embed important information 
about user satisfaction with a search engine and can provide a highly valuable 
source of relevance information. Compared to editorial labels, clicks are much 
cheaper to obtain and always reflect current relevance. Click logs can provide 
an important source of implicit feedback and can be used as a cheap proxy for 
editorial labels. Clicks have been used in multiple ways by a search engine: to 
tune search parameters, to evaluate different ranking functions [14], [58], or as 
signals to directly influence ranking [58]. However, clicks are known to be biased, 
by the presentation order, the appearance (e.g. title and abstract) of the docu- 
ments, and the reputation of individual sites. In [14], the relationship between 
clicks and relevance is modeled so that clicks can be used to unbiasedly evaluate 
search engine when lack of editorial relevance judgment. A dynamic Bayesian 
network proposed in [16] provides us with unbiased estimation of the relevance 
from the click logs. 

The collective feedback of the users of an information retrieval system provides 
semantic information that can be useful in web mining tasks. A richer data 
structure is used to preserve most of the information available in the web logs [35]. 
This data structure consists of three groups of entities, namely users, documents 
and queries, which are connected in a network of relations. Query refinements 
correspond to separate transitions between the corresponding query nodes in 
the graph, while users are linked to the queries they have issued and to the 
documents they have selected. The classical query/document transitions, which 
connect a query to the documents selected by the users in the returned result 
page, are also considered. 

A method proposed in [121] extracts concepts from users’ browsed docu- 
ments to create hierarchical concept profiles for personalized search in a privacy- 
enhanced environment. It assumes that the system knows the documents that 
the user is interested in, instead of using clickthrough. A query expansion method 
[25] is proposed based on user interactions recorded in the clickthrough data. The 
method focuses on mining correlations between query terms and document terms 
by analyzing user’s clickthroughs. Document terms that are strongly related to 
the input query are used together to narrow down the search. In [67], click- 
through data are used to estimate user’s conceptual preferences and personalized 
query suggestions are then provided for each individual user according to his/her 
conceptual needs. The motivation is that queries submitted to a search engine 
may have multiple meanings. Clickthrough data are exploited in the personal- 
ized clustering process to identify user preferences. To resolve the disadvantage 
of keyword-based clustering methods, clickthrough data has been used to clus- 
ter queries based on common clicks on URLs. One major problem with the 
clickthrough-based method is that the number of common clicks on URLs for 
different queries is limited. Thus, the chance for the users to see the same results 
would be small. Search queries are usually short and ambiguous, and thus are 
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insufficient for specifying the precise user needs. Some search engines suggest 
terms that are semantically related to the submitted queries so that users can 
choose from the suggestions the ones that reflect their information needs. 

In [94], users’ profiles are learned from their surfing histories and documents 
returned by a metasearch engine are reranked/filtered based on the profiles. In 
[73], a user profile and a general profile are learned from the user’s search history 
and a category hierarchy, respectively. The two profiles are combined to map a 
user query into a set of categories which represent the user’s search intention and 
serve as a context to disambiguate the words in the user’s query. Web pages are 
retrieved by merging multiple lists of web pages from multiple query submissions. 
Web search is conducted based on both the user query and the set of categories. 


XML format 


Extensible markup language (XML) has been recognized as a standard data 
representation for interoperability over the Internet. Data in XML documents are 
self-describing. A huge amount of information is formatted in XML documents. 
Decomposing the XML documents and storing them in relational tables is a 
popular practice. Similar to the popular hypertext markup language (HTML), 
XML is flexible in organizing data based on so-called nested tags. Tags in HTML 
associated with data express the presentation style of data, while tags in XML 
describe the semantics of data. The hierarchy formed by nested tags structures 
the content of XML documents. The role of nested tags in XML is similar to that 
of schemas in relational databases. At the same time, the nested XML model is 
far more flexible than the flat relational model. The basic data model of XML is 
a labeled and ordered tree. 

XML queries concern not only the content but also the structure of XML 
data. Basically, the queries can be formed using twig patterns, in which nodes 
represent the content part and edges the structural part of the queries. XML 
queries are categorized into two classes [45]: database-style queries and informa- 
tion retrieval style queries. Database-style queries return all query results that 
precisely match the queries. Commercial database systems are mainly relational 
database management systems (RDBMSs), and examples include IBM DB2, 
Microsoft SQL Server, and Oracle DB. Information retrieval style queries allow 
imprecise or fuzzy query results, which are ranked based on their relevance to 
the queries. Only the top-ranked results are returned to users, which is similar to 
the semantics of keyword search queries in the traditional information retrieval 
context. 

XML Path language (XPath) and XQuery are mainstream (database-style) 
XML query languages. Twig patterns play a very important role in XPath and 
XQuery. In XML, each document is associated with a document type description, 
which contains information describing the document structure. Large web sites 
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are becoming repositories of structured information that can benefit from being 
viewed and queried as relational databases. 

One important problem in XML query processing is twig pattern matching, 
that is, finding all matches in an XML data tree that satisfy a specified twig (or 
path) query pattern. Major techniques for twig pattern matching are reviewed, 
classified and compared in [45]. The relational approach and the native approach 
are two classes of major XML query processing techniques. A good trade-off 
between the two approaches would be storing XML data in the form of inverted 
lists by using existing relational databases, coupled with integrating efficient 
native join algorithms for XML twig queries into existing relational query opti- 
mizers [45]. 

XML documents can be compared as to their structural similarity, in order 
to group them into clusters so that different storage, retrieval, and processing 
techniques can be effectively exploited. Compared to standard methods based on 
graph-matching algorithms, the technique presented in [41] for detecting struc- 
tural similarity between XML documents is based on the idea of representing 
an XML document as a time series of a numerical sequence. Thus the structural 
similarity between two documents can be computed by exploiting the DFT of the 
associated signals, allowing a significant reduction of the required computation 
costs. 

The semantic web adds metadata and ontology information to web pages to 
make the web easier to be exploited by both humans and especially by programs. 
The paradigm of the semantic web helps use metadata as a largely untapped 
source in order to enhance activities of intelligent information management. RDF 
(resource description format) has become the standard language for represent- 
ing any semantic Web. It describes a semantic web using statements which are 
triples of the form (subject, property, and object). Subjects are resources which 
are uniquely identified by a uniform resource identifier (URI). In terms of seman- 
tic richness for multimedia information, the MPEG-7 Structured Annotations 
Description Scheme is among the most comprehensive and powerful, and has 
been applied to image annotation. 


Web usage mining 


Web usage mining consists in adapting data mining methods to access log file 
records. These files collect data such as the IP address of the connected machine, 
the requested URL, the date and other information regarding the navigation of 
the user. Web usage mining techniques provide knowledge about the behaviour 
of users in order to extract relationships from the recorded data. Sequential 
patterns are particularly suited to the study of logs. The access log file is first 
sorted by address and by transaction. Then all uninteresting data is pruned out 
from the file. During the sorting process, URLs and clients can be mapped to 
integers. Each time and date is also translated into a relative time with respect 
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to the earliest time in the log file. The structure of a log file is close to the 
client-time-item structure used by sequential pattern algorithms. 

In [122], clustering is used to segment user sessions into clusters or profiles that 
can later form the basis for personalization. Web utilization miner [105] discovers 
navigation patterns with user-specified characteristics over an aggregated materi- 
alized view of the web log, consisting of a tree of sequences of web views. Popular 
clustering approaches build clusters based on usage patterns derived from users’ 
page preferences. For the need to discover similarities in users’ accessing behavior 
with respect to the time locality of their navigational acts, two time-aware clus- 
tering approaches define clusters with users that show similar visiting behavior 
at the same time period, by varying the priority given to page or time visiting 
[91]. 

In [83], a complete framework and findings in mining web usage patterns are 
presented from web log files of a real web site that has all the challenging aspects 
of real-life web usage mining, including evolving user profiles and external data 
describing an ontology of the web content, as well as an approach for discovering 
and tracking evolving user profiles. The discovered user profiles can be enriched 
with explicit information need that is inferred from search queries extracted from 
web log data [83]. 

A specific data mining process is proposed in [78] in order to reveal the densest 
periods automatically. The approach is able to extract both frequent sequential 
patterns and the associated dense periods. 

Caching is a strategy for improving the performance of web-based systems. 
The heart of a caching system is its page replacement policy, which selects the 
pages to be replaced in a cache when a request arrives. A web-log mining method 
[123] caches web objects and a prediction algorithm predicts future web requests 
to improve the system performance. 


Association mining 


Associations are affinities between items. Association rule mining has been 
applied to analyze market baskets, helping managers realize which items are 
likely to be bought at the same time [70]. A well-known technique for discovering 
association rules from databases is the Apriori algorithm [3]. While association 
rules ignore ordering among the items, an Apriori variation respecting (tempo- 
ral) ordering emerged under the name sequence mining [4]. The link analysis 
technique mines relationships and discovers knowledge. Association discovery 
algorithms find combinations where the presence of one item suggests the pres- 
ence of another. Some algorithms can find association rules from nominal data 
[3]. 

Given a user keyword query, current web search engines return a list of indi- 
vidual web pages ranked by their goodness with respect to the query. Thus, the 
basic unit for search and retrieval is an individual page, even though informa- 
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tion on a topic is often spread across multiple pages. This degrades the quality 
of search results, especially for long or uncorrelated (multitopic) queries, where 
a single page is unlikely to satisfy the user’s information need. Given a keyword 
query, composed pages, which contain all query keywords, can be on the fly gen- 
erated by extracting and stitching together relevant pieces from hyperlinked web 
pages and retaining links to the original web pages [113]. To rank the composed 
pages, both the hyperlink structure of the original pages and the associations 
between the keywords within each page are considered. 

Given a time stamped transaction database and a user-defined reference 
sequence of interest over time, similarity-profiled temporal association mining 
discovers all associated item sets whose prevalence variations over time are sim- 
ilar to the reference sequence. The similar temporal association patterns can 
reveal interesting relationships of data items which co-occur with a particular 
event over time. Most works in temporal association mining have focused on cap- 
turing special temporal regulation patterns such as cyclic patterns and calendar 
scheme-based patterns. 


Ranking search results 


In a traditional search engine like Google, a query is specified by giving a set of 
keywords, possibly linked through logic operators and enriched with additional 
constraints (i.e., document type, language, etc.). Traditional search engines do 
not have the necessary infrastructure for exploiting relation-based information 
that belongs to the semantic annotations for a web page. PageRank [87], [10] 
and hyperlink induced topics search (HITS) [61] are widely applied to analyze 
the structure of the web. 

A simple counting of the number of links to a page does not take into account 
the fact that not all the citations have the same authority. PageRank used 
in Google web search engine very effectively ranks the results. PageRank is a 
topological-based ranking criterion. The authority of a page is computed recur- 
sively as a function of the authorities of the pages that link the target page. 
At query time, these importance scores are used in conjunction with query- 
specific information retrieval scores to rank the query results. PageRank has a 
clear efficiency advantage over the HITS algorithm, as the query-time cost of 
incorporating the precomputed PageRank importance score for a page is low. 
Furthermore, as PageRank is generated using the entire web graph, rather than 
a small subset, it is less susceptible to localized link spam. 

HITS estimates the authority and hub values of hyperlinked pages on the web, 
while PageRank merely ranks pages. For HITS, the authority is a measure of the 
page relevance as information source, and the hubness refers to the quality of a 
page as a link to authoritative resources. Documents with high authority scores 
have many links pointing to them. Documents with high hub scores point to 
many authoritative sites. The HITS scheme is query-dependent. User queries are 
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issued to a search engine in order to create a set of seed pages. Crawling the web 
forward and backward from that seed is performed to mirror the web portion 
containing the information which is likely to be useful. A ranking criterion based 
on topological analyses can be applied to the pages belonging to the selected 
web portion. 

In the semantic web, each page possesses semantic metadata that record addi- 
tional details concerning the web page itself. Annotations are based on classes of 
concepts and relations among them. The vocabulary for the annotation is usu- 
ally expressed by means of an ontology that provides a common understanding of 
terms within a given domain. Semantic search engines are capable of exploiting 
concepts (and relations) hidden behind each keyword together with natural lan- 
guage interpretation techniques to further refine the result set. A relation-based 
PageRank algorithm is used in conjunction with semantic web search engines 
that simply rely on information that could be extracted from user queries and 
on annotated resources [64]. Relevance is measured as the probability that a 
retrieved resource actually contains those relations whose existence was assumed 
by the user at the time of query definition. 


Surfer models 


Surfer models model a surfer who browses the Internet. There are a variety of 
surfer models, such as random surfer [10], HITS [61], directed surfer [97], and 
topic-sensitive PageRank [52]. The random surfer model assumes that the surfer 
is browsing web pages at random by either following a link from the current page 
chosen uniformly at random or by typing its URL. The directed surfer model 
assumes that, when the surfer is at any page, he jumps to only one of those 
pages that is relevant to the context, the probability of which is proportional 
to the relevance of each outlink. Both models guarantee the convergence of this 
stochastic process to a stationary distribution under mild assumptions like the 
irreducibility of the transition probability matrix. The computation of PageRank 
can be modeled by a single-surfer random walk by choosing a surfer model based 
only on two actions: the surfer jumps to a new random page with probability 
1 — d or follows one link from the current page with probability d. 

The directed surfer model encompasses the models which allow only forward 
walks. A surfer probabilistically chooses the next page to be visited depending 
on the content of the page and the query terms he is looking for [97]. A general 
probabilistic framework is proposed for web page scoring systems in [34]. The 
general web page scoring model extends both PageRank and HITS. A method- 
ology that simultaneously performs page ranking and context extraction is dealt 
with in [88] based on the principle of surfer models. A scalable and convergent 
iterative procedure is provided for its implementation. 

The PageRank algorithm for improving the ranking of search-query results 
computes a single vector, using the link structure of the web, to capture the rel- 
ative importance of web pages, independent of any particular search query. To 
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yield more accurate search results, a set of PageRank vectors, biased using a set 
of representative topics, are computed to capture more accurately the notion of 
importance with respect to a particular topic [52]. By using linear combinations 
of these (precomputed) biased PageRank vectors to generate context-specific 
importance scores for pages at query time, more accurate rankings can be gen- 
erated than with a single, generic PageRank vector. 

Based on HITS, an entropy-based analysis mechanism is proposed in [59] for 
analyzing the entropy of anchor texts and links to eliminate the redundancy of 
the hyperlinked structure so that the complex structure of a web site can be dis- 
tilled. However, to increase the value and the accessibility of pages, most of the 
content sites tend to publish their pages with intrasite redundant information, 
such as navigation panels, advertisements, copy announcements, etc. To further 
eliminate such redundancy, another mechanism InfoDiscoverer [59] applies the 
distilled structure to identify sets of article pages. InfoDiscoverer employs the 
entropy information to analyze the information measures of article sets and to 
extract informative content blocks from these sets. On the average, the aug- 
mented entropy-based analysis leads to prominent performance improvement. 


PageRank Algorithm 


The PageRank algorithm used by the Google search engine is an unsupervised 
learning method. The main idea is to determine the importance of a web page 
in terms of the importance assigned to the pages hyperlinking to it. The web is 
viewed as a directed graph of pages connected by hyperlinks. A random surfer 
starts from an arbitrary page and keeps clicking on successive links at random, 
visiting from page to page. 

The PageRank value of a page corresponds to the relative frequency the ran- 
dom surfer visits that page, assuming that the surfer goes on infinitely. The more 
time spent by the random surfer on a page, the higher the PageRank importance 
of the page. The PageRank algorithm considers a web page to be important if 
many other web pages point to it. It takes into account both the importance 
(PageRank) of the linking pages and the number of outgoing links that they 
have. Linking pages with higher PageRank are given more weight, while pages 
with more outgoing links are given less weight. The algorithm precomputes a 
rank vector that provides a priori importance estimates for all of the pages on 
the web. This vector is computed once, offline, and is independent of the search 
query. At the query time, lookup is implemented to find the value, and this is 
integrated with other strategies to rank the pages. 

Let lij = 1 if page j points to page i, and li; = 0 otherwise. The number of 
pages pointed to by page j (number of outlinks) is denoted by cj = ae lij, 
where N is the number of pages. The PageRanks p; are defined by the recursive 
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Ip, 

pi=(l—a) +a) pj, (25.5) 

j=l 
where the damping factor a is a positive constant (0.85). The importance of page 
i is the sum of the importance of pages that point to that page. The sums are 
weighted by 1/c;, that is, each page distributes a total vote of 1 to other pages. 
The constant a ensures that each page gets a PageRank of at least 1 — a. The 
first term is to prevent the presence of pages with no forward links. The random 
surfer escapes from the dangling page by jumping to a randomly chosen page. 
The surfer can avoid getting trapped into a bucket of the web graph, which is a 
reachable strongly connected component without outgoing edges toward the rest 
of the graph. 

In matrix form 


p= (1—a)1+ aLD;"p, (25.6) 


where 1 is a vector of N ones and D} = diag(c) is a diagonal matrix with diagonal 
elements c;. Assuming that the average PageRank is 1, 17 p = N, we have 


p = [(1—a)117/N + aLD;"|p = Ap, (25.7) 


where the matrix A is the expression in square braces. 

The PageRank vector can be found by using the power method. Viewing 
PageRank as a Markov chain, A has a real eigenvalue equal to unity, and this is 
its largest eigenvalue. We can find the desired PageRanks p by the power method 





i (25.8) 


starting with p = pọ. Since A has positive entries with each column summing 
to unity, Markov chain theory tells us that it has a unique eigenvector with 
eigenvalue one, corresponding to the stationary distribution of the chain. 

The system is interpreted as that, with probability œ the random surfer moves 
forward by following the links, and, with the complementary probability 1 — a 
the surfer gets bored of following the links and enters a new destination in the 
browser’s URL line, possibly unrelated to the current page. Setting a = 0.85 
implies that after about five link clicks the random surfer chooses a random 
page. 


Example 25.1: An example is shown in Figure 25.1. Each node is labeled with 
its PageRank score. Scores are normalized to sum to 100. a = 0.85. Page A is 
a dangling node, while pages C and D form a bucket. Page C receives only one 
link but from the most important page D, and its importance is high. Page F 
receives many more links, but from anonymous pages, and its importance is low 
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Figure 25.1 A PageRank instance with solution. 


compared to that of page C. Pages G, H, I, and J do not receive endorsements, 
and thus their scores correspond to the minimum amount of status of each page. 


PageRank does not consider the lengths of time that the web surfer spends on 
the pages during the browsing process. The staying time information can be used 
as good indicators of the importance of the pages. BrowseRank [71] leverages the 
user staying time information on web pages for page importance computation. 
It collects the user behavior data in web surfing and builds a user browsing 
graph, which contains both user transition information and user staying time 
information. A continuous-time Markov process is employed in BrowseRank to 
model the browsing behaviors of a web surfer, and the stationary distribution of 
the process is regarded as the page importance scores. 

Under the general Markov framework [43], a web Markov skeleton process is 
used to model the random walk conducted by the web surfer on a given graph. 
Page importance is then defined as the product of two factors: page reachabil- 
ity, the average possibility that the surfer arrives at the page, and page utility, 
the average value that the page gives to the surfer in a single visit. These two 
factors can be computed as the stationary probability distribution of the corre- 
sponding embedded Markov chain and the mean staying time on each page of 
the web Markov skeleton process, respectively. This general framework can cover 
PageRank and BrowseRank as special cases. 

PageRank favors pages with many in-links. Older pages may have accumulated 
many in-links. It is thus difficult to find the latest information on the web using 
PageRank. Time-sensitive ranking (TS-Rank) [68] extends PageRank by using a 
function of time in the damping factor d in PageRank. 

Some PageRank-inspired bibliometric indicators to evaluate the importance of 
journals using the academic citation network have been proposed and extensively 
tested: journal PageRank [9], Eigenfactor (http: //www.eigenfactor.org), and 
SCImago (http://www.scimagojr.com). GeneRank [82] is a modification of 
PageRank for using connectivity data to produce a prioritization of the genes in 
a microarray experiment. 
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Hypertext induced topic search (HITS) 


Let L = (l; j) be the adjacency matrix of the web graph, i.e., li; = 1 if page i 
links to page j and l; į = 0 otherwise. HITS defines a pair of recursive equations: 


eh) =ETyk) y® = Lr, (25.9) 


where æ is the authority vector containing the authority scores and y is the hub 
vector containing the hub scores, k > 1 and y) = 1, the vector of all ones. The 
first equation tells us that authoritative pages are those pointed to by good hub 
pages, while the second equation claims that good hubs are pages that point to 
authoritative pages. 

Notice that (25.9) is equivalent to 


a) = LTL D, yh) = LLI yD, (25.10) 


It follows that the authority vector x is the dominant right eigenvector of the 
authority matrix A = LTL, and the hub vector y is the dominant right eigenvec- 
tor of the hub matrix H = LL’. This is very similar to the PageRank method. 
To compute the dominant eigenpairs the power method can be exploited. While 
the convergence of the power method is guaranteed, the computed solution is 
not necessarily unique, since the authority and hub matrices are not necessarily 
irreducible. A modification similar to the teleportation trick used for the PageR- 
ank method can be applied to recover the uniqueness of the solution [125]. HITS 
is related to SVD [36]. It follows that the HITS authority and hub vectors cor- 
respond, respectively, to the right- and left-singular vectors associated with the 
highest singular value of the adjacency matrix L. 

An advantage of HITS with respect to PageRank is that it provides two rank- 
ings: the most authoritative pages and the most hubby pages. HITS has a higher 
susceptibility to spamming: while it is difficult to add incoming links to a favorite 
page, the addition of outgoing links is much easier. This leads to the possibility 
of purposely inflating the hub score of a page, indirectly influencing also the 
authority scores of the pointed pages. 


Data warehousing 


Data warehousing is a paradigm specifically intended to provide vital strate- 
gic information [93]. A data warehouse is a repository that integrates informa- 
tion from multiple data sources, which may or may not be heterogeneous and 
makes them available for decision support querying and analysis. Materialized 
views collect data from databases into the warehouse, but without copying each 
database into the warehouse. Queries on the warehouse can then be answered 
using the views instead of accessing the remote databases. When modification 
of data occurs on remote databases, they are transmitted to the warehouse. 
Architecture of a typical data warehouse is shown in Fig. 25.2. 
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Figure 25.2 Architecture of a typical data warehouse. 


Business intelligence, in the context of the data warehouse, is the ability of an 
enterprise to study past behaviors and actions in order to understand its past, 
determine its current situation, and predict or change what will happen in the 
future. Data marts are a subset of data warehouse data and are where most 
of the analytical activities in the business intelligence environment take place. 
The data in each data mart is usually tailored for a particular function, such as 
product profitability analysis and customer demographic analysis. Data mining 
fits well and plays a significant role in the data warehouse environment. 

The data warehouse provides the best opportunity for analysis, and online 
analytical processing (OLAP) is the vehicle for carrying out involved analysis. 
OLAP helps the user to analyze the past and gain insights, while data mining 
helps the user to predict the future. OLAP tools conceptually model the informa- 
tion as multidimensional cubes. The cubes in a data warehouse can be stored by 
following either a relational OLAP and/or a multidimensional OLAP approach. 
In relational OLAP, the data are stored in relational tables. 

SQL (structured query language) is the interface for retrieving and manipulat- 
ing data from relational databases. These methods are used in data warehouse 
environments. A data warehouse stores materialized views of data from one or 
more sources, for the purpose of implementing decision-support or OLAP queries. 

Database integrity is a very important motivation for studying temporal 
dependencies. Temporal constraints can take different forms. They can be 
expressed using first-order temporal logic [20], or using temporal dependencies 
[117], which are restricted classes of first-order temporal logic formulas. A tem- 
poral dependency, called trend dependency, captures a significant family of data 
evolution regularities (constraints). An example of such regularity is “Salaries of 
employees generally do not decrease”. Trend dependencies compare attributes 
over time using operators of {<,=,>,<,>,4}, allowing us to express mean- 
ingful trends. Trend dependency mining is investigated in [117] for a temporal 
database. 


ww ai bbt.com DOOOO000 


25.10 


Data mining 803 


Entities may have two or more representations in databases. We distinguish 
between two types of data heterogeneity: structural and lexical. Structural het- 
erogeneity occurs when the fields of the tuples in the database are structured 
differently in different databases. Lexical heterogeneity occurs when the tuples 
have identically structured fields across databases, but the data use different 
representations to refer to the same real-world object. Duplicate records do not 
share a common key and/or they contain errors that make duplicate matching a 
difficult task. Duplicate record detection is reviewed in [40], and multiple tech- 
niques are described for improving the efficiency and scalability of approximate 
duplicate detection algorithms. 

In uncertain data management, data records are typically represented by prob- 
ability distributions rather than deterministic values. A survey of uncertain data 
mining and management applications is provided in [2]. 


Content-based image retrieval 


The retrieval of images is either content-based or text-based retrieval. Early 
approaches to image retrieval were text-based approaches, and most search 
engines return images solely based on the text of the pages from which the 
images are linked. In text-based retrieval, some form of textual description of 
the image contents is assumed to be stored with the image itself by some form of 
HTML tagged text. However, since images on the web are poorly labeled, stan- 
dard keyword-based image searching techniques frequently yield poor results. 
Image-based features provide either alternative or additional signals for image 
retrieval. 

In content-based image retrieval (CBIR) [15], [98], [103], [56], image character- 
istics, such as color, shape or texture, are used for indexing and searching. CBIR 
systems adopt query by example as the primary query model where the user 
query is specified through an example image or features of the example image, 
which is then compared to the images in the database. The images retrieved are 
ranked according to a distance metric from the query. Multiple exemplars can be 
used for capturing user’s retrieval needs, and ranked results of individual queries 
using each image separately can be combined. 

The low-level visual features (color, texture, shape, etc.) are automatically 
extracted to represent the images. Most current CBIR systems are region-based. 
Global feature based retrieval is comparatively simpler. However, the low-level 
features may not accurately characterize the high-level semantic concepts. In 
CBIR, understanding the user’s needs is a challenging task. Relevance feedback is 
an effective tool for taking the user’s judgement into account [98]. With the user- 
provided negative and positive feedbacks, image retrieval can then be thought 
of as a classification problem. 

In [56], the image-ranking problem is cast into the task of identifying author- 
ity nodes on an infrared visual similarity graph, and VisualRank is proposed 
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to analyze the visual link structures among images. The images found to be 
authorities are chosen as those that answer the image-queries well. VisualRank 
is an end-to-end system to improve Google image search results with emphasis 
on robust and efficient computation of image similarities applicable to a large 
number of queries and images. VisualRank employs the random walk intuition 
to rank images based on the visual hyperlinks among the images. It incorporates 
the advances made in using link and network analysis for web document search 
into image search. 

On the Internet are many digital documents in scanned image format. 
Although the technology of document image processing may be utilized to auto- 
matically convert the digital images of these documents to the machine-readable 
text format using optical character recognition (OCR) technology, it is not a 
cost-effective and practical way to process a huge number of paper documents. 
A document image processing system needs to analyze different text areas in a 
page document, understand the relationship among these text areas, and then 
convert them to a machine-readable version using OCR, in which each character 
object is assigned to a certain class. A document image retrieval system [80] 
provides an answer of yes or no with respect to the user’s query, rather than the 
exact recognition of a character/word as in document image processing. Words, 
rather than characters, are the basic units of meaning in information retrieval. 
Therefore, directly matching word images in a document image is an alternative 
way to retrieve information from the document. An approach with the capabil- 
ity of matching partial word images [76] addresses word spotting and similarity 
measurement between documents. 

Approaches to CBIR typically extract a single signature from each image based 
on color, texture or shape features. The images returned as the query result are 
the ones whose signatures are closest to the signature of the query image. While 
efficient for simple images, such methods do not work well for complex scenes 
since they fail to retrieve images that match the query only partially, that is, 
only certain regions of the image match. WALRUS (WAveLet-based Retrieval 
of User-specified Scenes) [84] is a similarity retrieval algorithm that is robust 
to scaling and translation of objects within an image. It employs a similarity 
model in which each image is first decomposed into its regions and the similarity 
measure between a pair of images is then defined to be the fraction of the area of 
the two images covered by matching regions from the images. In order to extract 
regions for an image, WALRUS considers sliding windows of varying sizes and 
then clusters them based on the proximity of their signatures. WALRUS builds 
a set of a variable number of signatures for an image, one signature per image 
region. 

The aim of image auto-annotation is to find a group of keywords w* that max- 
imizes the conditional distributions p(w|I,), where I, is the uncaptioned query 
image and w are terms or phrases in the vocabulary. An attempt at model-free 
image annotation is a data-driven approach that annotates images by mining 
their search results [114]. The search process first discovers visually and seman- 
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tically similar search results, the mining process then identifies salient terms 
from textual descriptions of the search results, and the annotation rejection pro- 
cess filters out noisy terms yielded. Since no training data set is required, the 
approach enables annotating with unlimited vocabulary and is highly scalable 
and robust to outliers. 

In [22], multiple sources of evidence related to the images are considered. To 
allow combining these distinct sources of evidence, an image retrieval model 
is introduced based on Bayesian belief networks. Retrieval using an image 
surrounding text passages is as effective as standard retrieval based on HTML 
tags. A general method using kernel CCA [50] is presented to learn a seman- 
tic representation to web images and their associated text. It has established a 
general approach to retrieving images based solely on their content. 


Content-based music retrieval 

Music is one of the most popular types of online information on the web. Some 
of the huge music collections available have posed a major challenge for search- 
ing, retrieving, and organizing music content. Music search uses content-based 
methods. 

Low-level audio features are measurements of audio signals that contain infor- 
mation about a musical work and music performance. In general, low-level audio 
features are segmented in three different ways: frame based segmentations (peri- 
odic sampling at 10 ms—1000 ms intervals), beat-synchronous segmentations (fea- 
tures are aligned to musical beat boundaries), and statistical measures that con- 
struct probability distributions out of features (bag of features models). Many 
low-level audio features are based on the short-time spectrum of the audio signal. 

Music genre is probably the most popular description of music content. Most 
of the music genre classification algorithms resort to the so-called bag-of-features 
(BOF) approach [100], which models the audio signals by the long-term statis- 
tical distribution of their short-time spectral features. These features can be 
roughly classified into three classes (i.e., timbral texture features, rhythmic fea- 
tures, and pitch content features). Genres are next classified from the feature 
vectors extracted. 

Music identification is based on audio fingerprinting techniques, which aim at 
describing digital recordings with a compact set of acoustic features. An alterna- 
tive approach to music identification is audio watermarking. In this case, research 
on psychoacoustics is exploited to embed a watermark in a digital recording with- 
out altering sound perception. Similarly to fingerprints, audio watermarks should 
be robust to distortions, additional noise, D/A and A/D conversions, and lossy 
compression. 
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Email anti-spamming 


Fraud detection, intrusion detection and medical diagnosis are recognized as 
anomaly detection problems. Anomaly detection is to find objects that are dif- 
ferent from most other objects. The class imbalance problem is thus intrinsic to 
the anomaly detection applications. 

The e-mail spam problem continues to grow drastically. Various methods on 
near-duplicate spam detection have been developed [28]. Based on an analysis of 
e-mail content text, this problem is modelled as a binary text classification task. 
Representatives of this category are the naive Bayes [53] and SVM [38] methods. 

To achieve small storage size and efficient matching, prior works mainly repre- 
sent each e-mail by a succinct abstraction derived from e-mail content text. More- 
over, hash-based text representation is applied extensively. A common attack to 
this type of representation is to insert a random normal paragraph without any 
suspicious keywords into unobvious position of an e-mail. In such a context, if the 
whole e-mail content is utilized for hash-based representation, the near-duplicate 
part of spams cannot be captured. Hash-based text representation is not suitable 
for all languages. As important clues to spam detection, images and hyperlinks, 
however, are unable to be included in hash-based text representation. 

Noncontent information such as e-mail header, e-mail social network [19], and 
e-mail traffic [21] is exploited to filter spams. Collecting notorious and innocent 
sender addresses (or IP addresses) from e-mail header to create black list and 
white list is a commonly applied method initially. MailRank [19] examines the 
feasibility of rating sender addresses with the PageRank algorithm in the e-mail 
social network. However, e-mail header can be altered by spammers to conceal 
the identity. 

Regarding collaborative spam filtering with near-duplicate similarity matching 
scheme, peer-to-peer-based architecture [28] and centralized server based system 
are generally employed. The primary idea of the similarity matching scheme for 
spam detection is to maintain a known spam database, formed by user feedback, 
to block subsequent near-duplicate spams. On purpose of achieving efficient sim- 
ilarity matching and reducing storage utilization, prior works mainly represent 
each e-mail by a succinct abstraction derived from e-mail content text. An e-mail 
abstraction scheme proposed in [111] considers e-mail layout structure to repre- 
sent e-mails. The designed complete spam detection system Cosdes possesses an 
efficient near-duplicate matching scheme and a progressive update scheme. 

Detection of malicious web sites from the lexical and host-based features of 
their URLs is explored in [77]. Online algorithms not only process large numbers 
of URLs more efficiently than batch algorithms, they also adapt more quickly 
to new features in the continuously evolving distribution of malicious URLs. A 
real-time system for gathering URL features is developed and is paired with a 
real-time feed of labeled URLs from a large web mail provider. 
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25.1 How is data mining different from OLAP? Explain briefly. 


25.2 As a data mining consultant, you are hired by a large commercial bank 
that provides many financial services. The bank has a data warehouse. The 
management wants to find the existing customers who are most likely to respond 
to a marketing campaign offering new services. Outline the knowledge discovery 
process. 


25.3 Consider a table of linked web pages: 


Page Link to page 
BDEF 
CDEF 
BEF 
ABF 
ABC 
AC 


"agaw Se 


(a) Find the authorities and hubs of HITS by using two iterations. 
(b) Find the PageRank scores for each page after one iteration using 0.25 as the 
dampening factor. 


25.4 Consider the PageRank algorithm. 

(a) Show that from definition (25.5) the sum of the PageRanks is N, the number 
of web pages. 

(b) Write a program to compute the PageRank solutions by the power method 
using formulation (25.8). Apply it to the network of Fig. 25.1. 


25.5 Explain why PageRank can effectively fight spam. 
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Appendix A: Mathematical preliminaries 


In this Appendix, we provide mathematical preliminaries that are used in the 
preceding chapters. 


Linear algebra 
Pseudoinverse 


Definition A.1 (Pseudoinverse). Pseudoinverse AÏ, also called Moore- 
Penrose generalized inverse, of a matrix A € R™*” is unique, which satisfies 





AAA =A, A.11) 
ATAAT = AÏ, A.12) 
(AAT)? = AAT, A.13) 
(AtA)’ = ATA. A.14) 

Ai can be calculated by 
At = (ATA) AT (A.15) 

if ATA is nonsingular, and 
At = AT (AAT) ` (A.16) 


if AAT is nonsingular. Pseudoinverse is directly associated with the linear LS 
problem. 

When A is a square nonsingular matrix, pseudoinverse AŤ reduces to its inverse 
A>. For a scalar a, at = a! for a £0, and at = 0 for a = 0. 

For n x n identity matrix I and n x n singular matrix J, namely, det(J) = 0, 
fora #0 and a + nb #0, we have [6] 


1 b 
=] BeN atn 
(al +03) = = (1 = x) (A.17) 
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Linear least-squares problems 
The linear LS or Z2-norm problem is basic to many signal processing techniques. 
It tries to solve a set of linear equations, written in matrix form 


Az=b, (A.18) 


where A € R™*", x € R”, and be R”. 
This problem can be converted into the minimization of the squared error 
function 


E(w) = 5l|Aw — bl? = 5(Aw — b)" (As — b). (A.19) 
The solution corresponds to one of the following three situations [5]: 
e rank(A) =n =m. We get a unique exact solution 
x* = Atb (A.20) 


and E (x*) = 0. 
e rank(A) =n < m. The system is overdetermined, and has no exact solution. 
There is a unique solution in the least-squares error sense 


a* = Alb, (A.21) 
where Ai = (ATA) AT. In this case, 
E (x*) = b” (I- AA')b>0. (A.22) 


e rank(A)=m <n. The system is underdetermined, and the solution is not 
unique. But the solution with the minimum L2-norm ||æ||2 is unique 


x* = Alb. (A.23) 
Here At = AT (AAT) *. We have E (æ*) = 0 and |jx*||} = bT (AAT) b. 
Vector norms 
Definition A.2 (Vector norms). A norm acts as a measure of distance. A 


vector norm on R” is a mapping f : R” — R that satisfies such properties: For 
any x,y E R”,a€ R, 


e f(x)>0, and f(x) =0 iff x = 0. 
e f(x+y)< f(æ)+ fly). 
° f(ax) =|alf(z). 


The mapping is denoted as f(x) = ||æ||. 


The p-norm or L,-norm is a popular class of vector norms 


lelle = (>: ar) l (A.24) 
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with p > 1. The Li, Lə and DL. norms are more useful: 
lel], = $` zal, (A.25) 
i=1 


n 


Izl = Y (23)? = (272)? , (A.26) 


i=l 
læ = max [il (A.27) 


The Lə-norm is the popular Euclidean norm. 

A matrix Q € R™*™ is called an orthogonal matrix or unitary matrix if 
QTQ =I. The Lə-norm is invariant under orthogonal transforms, that is, for 
all orthogonal Q of appropriate dimensions 


lQæll = |æll2. (A.28) 


Matrix norms 
A matrix norm is a generalization of the vector norm by extending from R” to 
R™*””. For a matrix A = [a;;] 


the Frobenius norm 


mxn? the most frequently used matrix norms are 


n 


lle = |53], (A.29) 


i=1 j=1 


and the matrix p-norm 





A 
| Allp = sup 422 — max Aap, (A.30) 
240 |lællp  lælp= 


where sup is the supreme operation. 

The matrix 2-norm and the Frobenius norm are invariant with respect to 
orthogonal transforms, that is, for all orthogonal Qı and Q2 of appropriate 
dimensions 


Qi: AQo||p = ||Allz, (A.31) 
Qi: AQ2llə = ||Alle- (A.32) 


Eigenvalue decomposition 
Definition A.3 (Eigenvalue decomposition). Given a square matrix A € 
R”*”, if there exists a scalar A and a nonzero vector v such that 


Av =), (A.33) 


then and v are, respectively, called an eigenvalue of A and its corresponding 
eigenvector. All the eigenvalues ;,i = 1,...,n, can be obtained by solving the 
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characteristic equation 
det(A — AI) = 0, (A.34) 


where I is an n x n identity matriz. The set of all the eigenvalues is called the 
spectrum of A. 


If A is nonsingular, A; Æ 0. If A is symmetric, then all A; are real. The maxi- 
mum and minimum eigenvalues satisfy the Rayleigh quotient 
v Av vT Av 


Amax(A) = ery viv , Amin(A) = mau vlu . 








(A.35) 


The trace of a matrix is equal to the sum of all its eigenvalues and the deter- 
minant of a matrix is equal to the product of its eigenvalues 


tr(A) = 5 Nes (A.36) 
[A| = I Ài (A.37) 


Singular value decomposition 
Definition A.4 (Singular value decomposition). For a matrix A € 


R™*” , there exist real unitary matrices U = [u1, U2, ..., Um] E R™*™ and V = 
[U1,V2,---,Un] E R"*” such that 

UT AV = 5, (A.38) 
where & € R™*” is a real pseudodiagonal m x n matrix with oi, i = 1,..., p, p = 


min(m, n), 01 > 02 > ... > Op > 0, on the diagonal and zeros off the diagonal. 





ci’s are called the singular values of A, u; and v;i are, respectively, called the 
left singular vector and right singular vector for o;i. They satisfy the relations 


Avi = Oii, ATu; = iUi. (A.39) 
Accordingly, A can be written as 
A = UEV? = X Auo], (A.40) 
i=1 


where r is the cardinality of the smallest nonzero singular value. In the 
special case when A is a symmetric non-negative definite matrix, X = 
diag ce ee a, where A; > A2 >...Ap > 0 are the real eigenvalues of A, v; 
being the corresponding eigenvectors. 

SVD is useful in many situations. The rank of A can be determined by the 
number of nonzero singular values. The power of A can be easily calculated by 


MauUs'y*, (A.41) 
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where k is a positive integer. SVD is extensively applied in linear inverse prob- 
lems. The pseudoinverse of A can then be described by 


AVS UT, (A.42) 


where V,., X, and U, are the matrix partitions corresponding to the r nonzero 
singular values. 
The Frobenius norm can thus be calculated as 


1 
p z 
|Allz = (£ “) (A.43) 
i=1 
and the matrix 2-norm is calculated by 
| All2 = 1. (A.44) 


SVD requires a time complexity of O(mn min{m, n}) for a dense m x n matrix. 
Common methods for computing the SVD of a matrix are standard eigensolvers 
such as QR iteration and Arnoldi/Lanczos iteration. 


QR decomposition 
For the full-rank or overdetermined linear LS case, m > n, (A.18) can also be 
solved by using QR decomposition procedure. 

A is first factorized as 


A=QR, (A.45) 


where Q is an m x m orthogonal matrix, that is, QTQ =I, and R = Al is an 


m x n upper triangular matrix with R € R"*”. 
Inserting (A.45) into (A.18) and premultiplying by QT, we have 


Ra = Q’b, (A.46) 


Denoting QTb = | ~|, where b € R” and be R™-”, we have 
> b 


Ra =b. (A.47) 


Since R is a triangular matrix, æ can be easily solved using backward substitu- 
tion. This is the procedure used in the GSO procedure. 

When rank(A) < n, the rank-deficient LS problem has an infinite number of 
solutions, QR decomposition does not necessarily produce an othonormal basis 
for range(A) = {y € R” : y = Arx for some x € R”}. QR-cp can be applied to 
produce an orthonormal basis for range( A). 

As a basic method for computing SVD, QR decomposition itself can be com- 
puted by means of the Givens rotation, the Householder transform, or GSO. 
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Condition numbers 
Definition A.5 (Condition number). The condition number of a matriz A € 
R™*” is defined by 


cond,(A) = ||All, ATI, (A.48) 


where p can be selected as 1, 2, œo, Frobenius, or any other norm. 


The relation, cond(A) > 1, always holds. Matrices with small condition num- 
bers are well conditioned, while matrices with large condition number are poorly 
conditioned or ill-conditioned. The condition number is especially useful in 
numerical computation, where ill-conditioned matrices are sensitive to round- 
ing errors. 

For the L2-norm, 


cond2(A) = z, (A.49) 
Pp 


where p = min(m, n). 


Householder reflections and Givens rotations 
Orthogonal transforms play an important role in the matrix computation such as 
EVD, SVD, and QR decomposition. The Householder reflection, also termed the 
Householder transform, and Givens rotations, also called the Givens transform, 
are two basic operations in the orthogonalization process. These operations are 
easily constructed, and they introduce zeros in a vector so as to simplify matrix 
computations. The Householder reflection is exceedingly efficient for annihilating 
all but the first entry of a vector, while the Givens rotation is more effective to 
transform a specified entry of a vector into zero. 

Let v € R” be nonzero. The Householder reflection is defined as a rank-one 
modification to the identity matrix 


E 
P=1I-— 2. (A.50) 


The Householder matrix P € R"*” is symmetric and orthogonal. v is called 
a Householder vector. The Householder tranform of a matrix A is given by 
PA. By specifying the form of the transformed matrix, one can find a suit- 
able Householder vector v. For example, one can define a Householder vector as 
v = u — ae, where u € R™ is an arbitrary vector of length |a| and e; € R”, 
wherein only the first entry is unity, all the other entries being zero. In this 
case, Pa becomes a vector with only the first entry nonzero, where æ € R” isa 
nonzero vector. 

The Givens rotation G(i,k,@) is a rank-two correction to the identity matrix 
I. It modifies I by setting the (i, 7)th entry as cos @, the (i, &)th entry as sin 0, the 
(k,i)th entry as —sin@, and the (k, k)th entry as cos@. The Givens transform 
G(i,k,0)x applies a counterwise rotation of 0 radians in the (i,k) coordinate 
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plane. One can specify an entry in a vector to zero by applying the Givens 
rotation, and then calculate the rotation angle 0. 


Matrix inversion lemma 

The matrix inversion lemma is also called the Sherman-Morrison-Woodbury for- 
mula. It is useful in deriving many iterative algorithms. Assume that the rela- 
tionship between the matrix A € R”*” at iterations t and t+ 1 is given as 


A(t+1) = A(t) + AA(E). (A.51) 


If AA(t) can be expressed as UVT, where U € R”*™ and V € R™*", it is 
referred to as a rank-m update. The matrix inversion lemma gives [5] 


A™H(t +1) = A(t) — AAH (t) 
= A(t) — AH (t)U (I+ VTA (EU) VTA), (A.52) 


where both A(t) and (I+ VTA! (t)U) are assumed to be nonsingular. Thus, a 

rank-m correction to a matrix results in a rank-m correction to its inverse. 
Some modifications to the formula are available, and one popular update is 

given here. If A and B are two positive-definite matrices, which have the relation 


A =B! + CDCT, (A.53) 


where C and D are also matrices. The matrix inversion lemma gives the inverse 
of A as 


A™ = B — BC(D + CBC) 'C7B. (A.54) 


Partial least squares regression 

Partial LS regression [9] is extensively used in high-dimensional and severely 
underconstrained domains. It is a robust, iterative method that avoids matrix 
inversion for underconstrained data sets by decomposing the multivariate regres- 
sion problem into successive univariate regressions. Partial LS iteratively chooses 
its projection directions according to the direction of maximum correlation 
between the (current residual) input and the output. Computation of each pro- 
jection direction is O(d) for d dimensions of the data. Successive iterations create 
orthogonal projection directions by removing the subspace of the input data used 
in the previous projection. The number of projection directions found by partial 
LS is bound only by the dimensionality of the data, with each univariate regres- 
sion on successive projection components further reducing the residual error. 
Using all d projections leads to ordinary LS regression. If the distribution of 
the input data is spherical, then partial LS requires only a single projection to 
optimally reconstruct the output. 
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Linear scaling and data whitening 
By linear normalization, all the raw data can be brought in the vicinity of an 


average value. For a one-dimensional data set, {x;|i = 1,..., N}, the mean ji 
and variance G? are estimated by 


1 N 
i= a Li, (A.55) 


eas (zi — ny: (A.56) 





pa a (A.57) 
ő 
The transformed data set {%;|i = 1,..., N} has zero mean and unit standard 
deviation. 
When the raw data set {a;|i = 1,..., N} is composed of vectors, accordingly, 


the mean vector u and covariance matrix © are calculated by 


1 N 
==) z, (A.58) 


A p P 
S = —— Y (z; - jt) (wi — À)”. (A.59) 


Equations (A.56) and (A.59) are, respectively, the unbiased estimates of the 
variance and the covariance matrix. When the factor ya is replaced by x, the 
estimates for u and X are the ML estimates. The ML estimate for variance and 
covariance are biased. 

New input vectors can be defined by the linear transformation 


Tı = A UT (x; — pa), (A.60) 


where U = [w,..., um], A = diag (A1,..., Am), M is the dimension of data vec- 
tors, and A; and u; are the eigenvalues and the corresponding eigenvectors of X, 
which satisfy 


The new data set {#;} has zero mean and its covariance matrix is the identity 
matrix [1]. The above process is also called data whitening. 
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Gram-Schmidt orthonormalization transform 


Ill-conditioning is usually measured for a data matrix A by its condition number 
p, defined as p(A) = ae where Cmax and Omin are respectively the maximum 
and minimum singular values of A. In the batch LS algorithm, the information 
matrix ATA needs to be manipulated. Since p (ATA) = p(A)?, the effect of 
ill-conditioning on parameter estimation will be more severe. Orthogonal decom- 
position is a well-known technique to eliminate ill-conditioning. 

The GSO procedure starts with QR decomposition of the full feature matrix. 
Denote 


X = |£1, £2,..., EN], (A.62) 
where the ith pattern £; = (£i,1, %i,2,... Zig), £i j denotes the jth component 
of xi, and J is the dimensions of the raw data. We then represent XT by 

xT = [E EA vee æT] ; (A.63) 
where a = (1,5, 02,5, ers £N j)”. 
QR decomposition is performed on XT 
XT = QR, (A.64) 


where Q is an orthonormal matrix, that is, QTQ = Iz, Q = [qi,q@9,.--,q5]; 
qi = (Gi1; Ui,2;-- Gn); qij denoting the jth component of q;, and R is an 
upper triangular matrix. QR decomposition can be performed by the House- 
holder transform or Givens rotation [5], which is suitable for hardware imple- 
mentation. 

The GSO procedure is given as 


an = a (A.65) 
k-1 
dy = £" — Ñ and, (A.66) 
i=1 
= , for i=1,2,...,k—-1 
Qik = Ñ 1, for i=k : (A.67) 
0, for i>k 
Thus q; is a linear combination of «!,...,a2", and the Gram-Schmidt features 
Q1,--»,Q, and the vectors «!,...,a* are one-to-one mappings, for 1< k < J. 


The GSO transform can be used for feature subset selection; it inherits the 
compactness of the orthogonal representation and at the same time provides 
features retaining their original meaning. 
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Stability of dynamic systems 


For a dynamic system described by a set of ordinary differential equations, the 
stability of the system can be examined by Lyapunov’s second theorem or the 
Lipschitz condition. 

The Lyapunov theorem is a sufficient, but not a necessary tool for proving the 
stability of an equilibrium of a dynamic system. The method is dependent on 
finding a Lyapunov function for the equilibrium. It is especially important for 
analyzing the stability of recurrent networks and ordinary differential equations. 


Theorem A.1 (Lyapunov). Consider a function L(x). Define a region Q, 
where any point x € Q satisfies L(x) < c for a constant c, with the boundary of 
Q given by L(x) = c, such that 





° dlle) < 0, Yx, x* € Q, x 4 ax. 





Then, the equilibrium point x = x* is asymptotically stable, with a domain of 
attraction Q. 


Theorem A.2 (Lyapunov’s second theorem). For a dynamic system 
described by a set of differential equations 


2 = f(x), (A.68) 


where x = (a(t), @2(t),... En (t) and f = (fi, fos.. Pe ae if there exists a 
positive definite function E = E(a), called a Lyapunov function or energy func- 
tion, such that 


dE OE dz; 
— = —— <0 A.69 
with aE = 0 only for de = 0, then the system is stable, and the trajectories x 


will asymptotically converge to stationary points as t > oo. 


The stationary points are also known as equilibrium points and attractors. 
The crucial step in applying the Lyapunov’s second theorem is to find a suitable 
energy function. 


Theorem A.3 (Lipschitz condition). For a dynamic system described by 
(A.68), a sufficient condition that guarantees the existence and uniqueness of 
the solution is given by the Lipschitz condition 


If (w1) — f (æ2)|| < 7 laa — wall, (A.70) 
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where y is any positive constant, called Lipschitz constant, and x1, £2 are any 
two variables in the domain of the function vector f. f(a) is said to be Lipschitz 
continuous. 


If xı and x2 are in some neighborhood of x, then they are said to satisfy the 
Lipschitz condition locally and will reach a unique solution in the neighborhood 
of x. The unique solution is a trajectory that will converge to an attractor 
asymptotically and reach it only at t — oo. 


Probability theory and stochastic processes 


Conditional probability 

For two statements (or propositions) A and B, one writes A|B to denote the 
situation that A is true subject to the condition that B is true. The probability 
of A|B, called conditional probability, is denoted by P(A|B). This gives a measure 
for the plausibility of the statement A|B. 


Gaussian distribution 
The Gaussian distribution, known as the normal distribution, is the most com- 
mon assumption for error distribution. The pdf of the normal distribution is 
defined as 

1 _ (zp)? 


plz) = e 2°, «ER, (A.71) 
ov 2T 





where u is the mean and ø > 0 is the standard deviation. For the Gaussian 
distribution, 99.73% of the data are within the range of |u — 30, u + 30]. The 
Gaussian distribution has its first-order moment as u, second-order moment as 
a°, and higher-order moments as zero. If u =0 and ø = 1, the distribution is 
called the standard normal distribution. The pdf is also known as the likelihood 
function. An ML estimator is a set of values (u,a) that maximizes the likelihood 
function for a fixed value of x. 

The cumulative distribution function (cdf) is defined as the probability that 


a random variable is less than or equal to a value z, that is, 
F(a) = l p(t)dt. (A.72) 


The standard normal cdf, conventionally denoted ®, is given by setting u = 0 
and o = 1. The standard normal cdf is usually expressed by 


B(x) = 5 i + erf (=)| (A.73) 


where the error function erf(x) is a nonelementary function, which is defined by 


erf(x) = = | Pdt. (A.74) 
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When vector æ € R”, the pdf of the normal distribution is then defined by 


= 1 -3 (æ-u) E (x-y) 
p(z) = Gn) , (A.75) 
where u is the mean vector and & is the covariance matrix. 

The Gaussian distribution is only one of the canonical exponential distribu- 
tions, and it is suitable for describing real-value data. In the case of binary-valued, 
integer-valued, or non-negative data, the Gaussian assumption is inappropriate, 
and a family of exponential distributions can be used. For example, Poisson’s 
distribution is better suited for integer data and the Bernoulli distribution to 
binary data, and an exponential distribution to non-negative data. 


Cauchy distribution 
The Cauchy distribution, also known as the Cauchy-Lorentzian distribution, is 
another popular data distribution model. The density of the Cauchy distribution 
is defined as 

1 


ro [t+ e] 
where u specifies the location of the peak and o specifies the half-width at the 
half-maximum. When ps = 0 and ø = 1, the distribution is called the standard 


Cauchy distribution. 
Accordingly, the cdf of the Cauchy distribution is calculated by 


ceR, (A.76) 





es er (z = r) pS (A.77) 
T o 2 
None of the moments is defined for the Cauchy distribution. The median of the 
distribution is equal to u. Compared to the Gaussian distribution, the Cauchy 
distribution has a longer tail; this makes it more valuable in stochastic search 
algorithms by searching larger subspaces in the data space. 


Student-t models 
The Student-t pdf is given by 





p(x) = Vat (4) Gee) (A.78) 





where T (-) is the Gamma function, and v is the degrees of freedom. The Gaussian 
distribution is a particular t distribution with v = oo. 

For a random sample of size n from a normal distribution with mean u, we 
get the statistic 


_ bru 
t= (A.79) 
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Figure A.3 The Student-t distribution with v = 4 and standard normal distribution. 


where T is the sample mean and ø is the sample standard deviation. t has a 
Student-t distribution with n — 1 degrees of freedom. 

The Student-t distribution has a longer tail than the Gaussian distribution. 
The pdfs of the Student-t distribution and the normal distribution are plotted 
in Fig. A.3. 


Kullback-Leibler divergence 

Mutual information between two signals x and y is characterized by calculating 
the crossentropy, known as Kullback-Leibler divergence, between the joint pdf 
p(x, y) of x and y and the product of the marginal pdfs p(a) and p(y) 


ais lau ae 
rey) = fp Ly) ETa dy. (A.80) 


This may be implemented by estimating the pdfs in terms of the cumulants of 
the signals. This approach requires the numerical estimation of the joint and 
marginal densities. 


Cumulants 
For random variables X1,..., X4, second-order cumulants are defined as 
cum(X,, X2) = E[X,X9], where X; = X; — E[X;], and the fourth-order cumu- 
lants are [3] 
cum(X1, Xo, X3, X4) = [X1X2X3X4] — E [X1X2] E X; X4] 
-E |X: X3] E[X2X4] — E |X: X4] E[X2X3] . (A.81) 
The variance and kurtosis of a real random variable X are defined by 


var(X) = 0?(X) = cum(X,X) = E[X’], A.82) 





kurt(X) = cum(X, X, X, X) = E[X*] - 32° [X’]. A.83) 


They are the second- and fourth-order autocumulants. A cummulant having at 
least two different variables is called a cross-cumulant. 
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Markov processes, Markov chains and Markov-chain analysis 
Markov processes constitute the best-known class of stochastic processes. A 
Markov process has a limited memory. Assume a stochastic process {X (t): t € 
T}, where t is time, X(t) is a state in the state space S. A Markov process is 
defined as a stochastic process that satisfies the relation characterized by the 
conditional distribution 


P [X (to+ti) < a|X (to) = £o, X(T) =2,,-00 < T < to] 
= P [X (to + t1) < z| X (to) = zo] (A.84) 


for any value of to and for tı > 0. The future distribution of the process is 
determined by the present value of X (to) only. This latter property is known as 
the Markov property. 

When 7 and S are discrete, a Markov process is called a Markov chain. Con- 
ventionally, time is indexed using integers, and a Markov chain is a set of random 
variables that satisfy 





P| Xnr = inl Xn 1 = Tn 1,Xn 2 = Tn aa 
= P| Xn in| Xni = tn] (A.85) 


This definition can be extended for multistep Markov chains, where a chain state 
has conditional dependency on only a finite number of its previous states. 

For a Markov chain, P [Xn =j |Xn- i= il is the transition probability of state 
i to j at time n — 1. If 


P [Xn = j|Xn-1 = i] = P [Anon = j|Xntm-1 =i], SO, i eS, 
(A.86) 
the chain is said to be time homogeneous. In this case, one can denote 


Py =P [Xn = j|Xn~1 = il (A.87) 


and the transition probabilities can be represented by a matrix, called the tran- 
sition matrix, P = [P; j], where i, j = 0,1,.... For finite S, P has a finite dimen- 
sion. An important property of Markov chains is their time homogeneity, which 
means that their transition probabilities p;; do not depend on time. 

In Markov-chain analysis, the transition probability after k step transitions 
is P*. The stationary distribution or steady-state distribution is a vector that 
satisfies 


Pl x* =n". (A.88) 


That is, m* is the left eigenvector of P corresponding to the eigenvalue 1. 

If P is irreducible and aperiodic, that is, every state is accessible from every 
other state and in the process none of the states repeats itself periodically, then 
P* converges elementwise to a matrix each row of which is the unique stationary 
distribution 7*, with 

lim (P*)' w= r. (A.89) 


k= 
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0.3 0.5 


Figure A.4 State diagram of a Markov chain. 


A.6 


Many modeling applications are Markovian, and Markov-chain analysis is widely 
used for convergence analysis for algorithms. 


Example A.2: The transition probability matrix corresponding to the graph in 
Fig. A.4 is given by 


0.30.5 0.2 0 
0 0 0.20.8 
n 0 0 0.30.7 
0 0 0 1 


Probabilities of transitions from state 7 to all other states add up to one, i.e., 
N 
ja Py =1. 


Numerical optimization techniques 


Although optimization problems can be solved analytically in some cases, numer- 
ical optimization techniques are usually more powerful and are also indispensible 
for all disciplines in science and engineering. Optimization problems discussed 
in this book are mainly unconstrained continuous optimization problems, COPs, 
and quadratic programming problems. To deal with constraints, the KKT the- 
orem, as a generalization to the Lagrange multiplier method, introduces a slack 
variable into each inequality constraint before applying the Lagrange multiplier 
method. The conditions derived from the procedure are known as the KKT con- 
ditions [4]. 


A brief taxonomy 

Optimization techniques can generally be divided into derivative methods and 
nonderivative methods, depending on whether or not derivatives of the objective 
function are required for the calculation of the optimum. Derivative methods 
can be either gradient-search methods or second-order methods. Gradient-search 
methods include the gradient descent, CG methods, and the natural-gradient 
method. The gradient descent is also known as steepest descent. It searches for a 
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local minimum by taking steps along the negative direction of the gradient of the 
function. If the steps are along the positive direction of the gradient, the method 
is known as gradient ascent or steepest ascent. The gradient-descent method is 
credited to Cauchy. Examples of second-order methods are Newton’s method, 
the Gauss-Newton method, quasi-Newton methods, the trust-region method, 
and the LM method. CG methods can also be viewed as a reduced form of the 
quasi-Newton method, with systematic reinitializations of H; to the identity 
matrix. 

Derivative methods can also be classified into model-based and metric-based 
methods. Model-based methods improve the current point by a local approxi- 
mating model. Newton and quasi-Newton methods are model-based methods. 
Metric-based methods perform a transformation of the variables and then apply 
a gradient search method to improve the point. The steepest-descent method, 
quasi-Newton methods, and CG methods belong to this latter category. 

Typical nonderivative methods for multivariable functions are random-restart 
hill-climbing, simulated annealing, evolutionary algorithms, random search, 
many heuristic methods, and their hybrids. Hill-climbing attempts to optimize a 
discrete or continuous function for a local optimum. When operating on continu- 
ous space, it is called gradient ascent. Other nonderivative search methods include 
univariant search parallel to an axis, sequential simplex method, and acceleration 
methods in direct search such as the Hooke-Jeeves method, Powell’s method and 
Rosenbrock’s method. The Hooke-Jeeves method accelerates in distance, Pow- 
ell’s method accelerates in direction, and Rosenbrock’s method accelerates in 
both direction and distance. Interior-point methods represent state-of-the-art 
techniques for solving linear, quadratic and nonlinear optimization programs. 


Lagrange multiplier method 

The Lagrange multiplier method can be used to analytically solve continuous 
function optimization subject to equality constraints [4]. Let f(a) be the objec- 
tive function and h;(a#) = 0, i = 1,...,m, be the constraints. The Lagrange func- 
tion can be constructed as 


L(x; Aij swig Àm) = f (£) + >, Aihi(£), (A.90) 
i=1 
where \;, i =1,...,m, are called the Lagrange multipliers. 


The constraned optimization problem is converted into an unconstrained opti- 
mization problem: Optimize L (æ; 1,...,Am). By setting 


o 

Ir” (£; Ax, ae spa) = 0, (A.91) 
SL (eA Am) =0, t=1,...,m (A.92) 
ði 1AL +229 Am) = +) = gereg ; F 


and solving the resulting set of equations, we can obtain the x position at the 
extremum of f(a) under the constraints. 
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Line search 

The popular quasi-Newton and CG methods implement a line search at each 
iteration. The efficiency of the line-search method significantly affects the per- 
formance of these methods. 

Bracketing and sectioning are two elementary operations for any line search 
method. A bracket is an interval (a,,a2) that contains an optimal value of 
a. Any three values of a that satisfy a, < ag < a3 form a bracket when the 
values of the function f(a) satisfies f (a2) < min (f (a1), f (a3)). Sectioning is 
applied to reduce the size of the bracket at a uniform rate. Once a bracket is 
identified, it can be contracted by using sectioning or interpolation techniques or 
their combinations. Popular sectioning techniques are the gloden-section search, 
the Fibonacci search, the secant method, Brent’s quadratic approximation, and 
Powell’s quadratic-convergence search without derivatives. The Newton-Raphson 
search is an analytical line-search technique based on the gradient of the objective 
function. Wolfe’s conditions are two inequality conditions for performing inexact 
line search. Wolfe’s conditions enable an efficient selection of the step size without 
minimizing f(a). 


Semidefinite programming 

For a convex optimization problem, a local solution is the global optimal solu- 
tion. The semidefinite programming (SDP) problem is a convex optimization 
problem with a linear objective, and linear matrix inequality and affine equality 
constraints. It optimizes convex cost functions over the convex cone of posi- 
tive semidefinite matrices. There exist interior-point algorithms to solve SDP 
problems with good theoretical and practical computational efficiency. One very 
useful tool to reduce a problem to an SDP problem is the so-called Schur com- 
plement lemma. The SDP problem can be efficiently solved using standard SDP 
solvers such as a C library for semidefinite programming [2] and the MATLAB 
packages SeDuMi [7] and SDPT3 [8]. 

Many stability or constrained optimization problems including the SDP prob- 
lem can be converted into a quasi-convex optimization problem in the form of 
an linear matrix inequality (LMI)-based optimization problem. The LMI-based 
optimization problem can be efficiently solved by interior-point methods by using 
MATLAB LMI Control Toolbox. For verifying the stability of delayed neural net- 
works, a Lyapunov function is usually constructed based on the LMI approach. 

The constrained concave-convex procedure (CCCP) of [10] is used for solv- 
ing the non-convex optimization problem. CCCP essentially decomposes a non- 
convex function into a convex component and a concave component. At each 
iteration, the concave part is replaced by a linear function (namely, the tangen- 
tial approximation at the current point) and the sum of this linear function and 
the convex part is minimized to get the next iteration. 
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A.1 For non-binary data, show that ||a||1 > ||x|/2 > ||xIl00. 
A.2 Draw the Student-t and Gaussian distributions. 


A.3 Consider the function f(x) = 10x? + 4a? + 3x + 12. 
(a) Compute its gradient. 
(b) Find all its local and global maxima/minima. 
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In this Appendix, we provide some benchmarks and resources for pattern recog- 
nition and data mining. 


Face databases 


The face image data has been standardized as ISO/IEC JTC 1/SC 37 N 506 
(Biometric Data Interchange Formats, Part 5: Face Image Data). 

The AT&T Olivetti Research Laboratory (ORL) face recog- 
nition database (http://www.cl.cam.ac.uk/research/dtg/attarchive/ 
facedatabase.htm1) includes 400 images from 40 individuals. Each individual 
has 10 images: 5 for training and 5 for testing. The face images in ORL only 
contain pose variation, and are perfectly centralized/localized. All the images 
are taken against a dark homogeneous background but vary in sampling time, 
illuminations, facial expressions, facial details (glasses/no glasses), scale, and tilt. 
Each image with 256 gray scales is in the size of 92 x 112. 

The California Institute of Technology (CIT) face database (http: 
//www.vision.caltech.edu/Image_Datasets/faces/) has 450 color images, 
the size of each being 320 x 240 pixels, and contains 27 different people and a 
variety of lighting, backgrounds and facial expressions. 

The MIT CBCL face database (http://cbcl.mit.edu/cbcl/ 
software-datasets/FaceData2.html) has 6,977 training images (with 
2,429 faces and 4,548 nonfaces) and 24,045 test images (472 faces and 23,573 
nonfaces). All images are captured in grayscale at a resolution of 19 x 19 pixels, 
but rather than use pixel values as features. 

The Face Recognition Grand Challenge (FRGC) database [6] consists 
of 1,920 images, corresponding to 80 individuals selected from the original col- 
lection. Each individual has 24 controlled or uncontrolled color images. The faces 
are automatically detected and normalized through a face detection method and 
an extraction method. FRGC data set provides high resolution images and 3D 
face data. 

The FRGC v2 database is the largest available database of 3D face images 
composed of 4, 007 images with different facial expressions from 466 subjects with 
different facial expressions. All images have resolution of 640 x 480, acquired by 
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a Minolta Vivid 910 laser scanner. The face images have frontal pose and several 
types of facial expression: neutral, happy, sad, disgusting, surprised, and puffy 
cheek. Moreover, some images present artifacts: stretched or distorted images, 
nose absence, holes around nose, or waves around mouth. 

The Carnegie Mellon University (CMU) PIE (pose, illumina- 
tion, and expression) face database (http://www.ri.cmu.edu/projects/ 
project_418.htm1) contains 68 subjects with 41,368 face images in total. 

The Yale face database (cvc.yale.edu/projects/yalefaces/ 
yalefaces.html) contains 165 grayscale images in GIF format of 15 indi- 
viduals. There are 11 images per subject, one per different facial expression or 
configuration: center-light, w/glasses, happy, left-light, w/no glasses, normal, 
right-light, sad, sleepy, surprised, and wink. 

The Yale face database B (cvc.yale.edu/projects/yalefacesB/ 
yalefacesB.htm1) allows for systematic testing of face recognition methods 
under large variations in illumination and pose. 

AR face database (http://www2.ece.ohio-state.edu/~aleix/ 
ARdatabase.html) contains over 4,000 color images corresponding to 126 
people’s faces (70 men and 56 women). Images feature frontal view faces with 
different facial expressions, illumination conditions, and occlusions (sun glasses 
and scarf). 

The Japanese female facial expression (JAFFE) Database (http:// 
www.kasrl.org/jaffe.htm1) contains 213 images of 7 facial expressions (6 basic 
facial expressions + 1 neutral) posed by 10 Japanese female models. Each image 
has been rated on 6 emotion adjectives by 60 Japanese subjects. 

Oulu physics-based face database (www.ee.oulu.fi/research/imag/ 
color/pbfd.html) contains faces of 125 different individuals, each in 16 differ- 
ent camera calibration and illumination condition, an additional 16 if the person 
has glasses. Faces are in frontal position, captured under horizon, incandescent, 
fluorescent and daylight illuminant. The database includes 3 spectral reflectance 
of skin per person measured from both cheeks and forehead, and contains RGB 
spectral response of camera used and spectral power distribution of illuminants. 

The Sheffield (previously UMIST) face database consists of 564 images 
of 20 individuals (mixed race/gender/appearance). Each individual is shown in 
a range of poses from profile to frontal views. The files are all in PGM format, 
approximately 220 x 220 pixels with 256-bit grayscale. 

The University of Notre Dame 3D face database (http: //www.nd.edu/ 
~cvrl/CVRL/Data_Sets.htm1) includes a total of 275 subjects, among which 200 
subjects participated in both a gallery acquisition and a probe acquisition. The 
time lapse between the acquisitions of the probe image and the gallery image for 
any subject ranges between one to thirteen weeks. The 3D scans in the database 
were acquired using a Minolta Vivid 900 range scanner. All subjects were asked 
to display a neutral facial expression and to look directly at the camera. The 
result is a 640 x 480 array of range data. 
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The Binghamton University BU-3DFE database is a database of anno- 
tated 3D facial expressions [9]. There are a total of 100 subjects in the database, 
56 females and 44 males. A neutral scan was captured for each subject, then they 
were asked to perform six expressions: happiness, anger, fear, disgust, sad and 
surprise. The expressions vary according to four levels of intensity (low, middle, 
high and highest). Thus, there are 25 3D facial expression models per subject. 
A set of 83 manually annotated facial landmarks is associated to each model. 
These landmarks are used to define the regions of the face that undergo specific 
deformations due to muscle movements when conveying facial expression. 

The FG-NET aging database (http://www.fgnet .rsunit.com) contains 
1,002 high-resolution color or grayscale face images of 82 subjects at different 
ages, with the minimum age being 0 and the maximum age being 69, with large 
variation of lighting, pose and expression. 

MORPH data corpus (http: //www.faceaginggroup.com/ 
projects-morph.html) has two separate databases: Album1l and Album2. 
Album1 contains 1,690 images from 625 different subjects. Album2 contains 
more than 20,000 images from more than 4,000 subjects whose metadata (age, 
sex, ancestry, height, and weight) are also recorded. 

Iranian face aging database (http://kiau.ac.ir/bastanfard/IFDB_ 
index.htm) contains digital images of people from 1 to 85 years of age. It is 
a large database that can support studies of the age classification systems. It 
contains over 3,600 color images. 

The Bosphorus database (http://bosphorus.ee.boun.edu.tr/Home. 
aspx) is intended for research on 3D and 2D human face processing tasks includ- 
ing expression recognition, facial action unit detection, facial action unit intensity 
estimation, face recognition under adverse conditions, deformable face model- 
ing, and 3D face reconstruction. There are 105 subjects and 4666 faces in the 
database. 

The XM2VTS face video database (http://www.ee.surrey.ac.uk/ 
CVSSP/xm2vtsdb/) contains four recordings of 295 subjects taken over a period of 
four months. The BioID face detection database is available at http://support. 
bioid.com/downloads/facedb/index.php. Some other face databases are given 
at http://www.face-rec.org/databases/. 

The Yahoo! News face data set was constructed from about half a mil- 
lion captioned news images collected from the Yahoo! News website by crawling 
from the web [1]. It consists of a large number of photographs taken in real-life 
conditions. As a result, there are a large variety of poses, illuminations, expres- 
sions, and environmental conditions. There are 1940 images, corresponding to 
97 largest face clusters, in which each individual cluster has 20 images. Faces 
are cropped from the selected images using the face detection and extraction 
methods. 
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Some machine learning databases 


UCI repository of machine learning repository (http://archive.ics.uci.edu/ 
ml/) contains the following popular data sets. 


e The HouseVotes data set contains the 1984 congressional voting records 
for 435 representatives voting on 17 issues. Votes are all three-valued: yes, no, 
or unknown. For each representative, the political party is given; this data set 
is typically used in a classification setting to predict the political party of the 
representative based on the voting record. 

e The mushroom data set contains physical characteristics of 8124 mush- 
rooms, as well as whether each mushroom is poisonous or edible. There are 
22 physical characteristics for each mushroom, all of which are discrete. 

e The adult data set has 48,842 patterns of 15 attributes, including 8 cat- 
egorical attributes, 6 numerical attributes, and 1 class attribute. The class 
attribute indicates whether the salary is over 50,000. In the data set, 76% 
of the patterns have the value of < 50,000. The goal is to predict whether a 
household has an income greater than $50,000. 

e The iris data set has 150 data samples from three classes (setosa, versicolor, 
and virginica) with 4 measurements (Sepal length, Sepal width, Petal length, 
Petal width). 

e The Wisconsin diagnostic breast cancer data (WDBC) contains 569 
samples, each with 30 features. The samples are grouped into two clusters: 
357 samples for benign and 212 for malignant. 

e The Boston housing data set consists of 516 instances with 12 input 
variables (including a binary one) and an output variable representing the 
median housing values in suburbs of Boston. 

e Microsoft web training data (MSWeb) contains 32, 711 instances of users 
visiting the www.microsoft.com website on one day in 1996. For each user, 
the data contains a variable indicating whether or not that user visited each 
of the 292 areas of the site. 

e The image segmentation database consists of samples randomly drawn 
from a database of seven outdoor images. The images were hand segmented 
to create a classification for every pixel. Each sample has a 3 x 3 region and 
19 attributes. There are a total of 7 classes, each having 330 samples. The 
attributes were normalized to lie in [—1, 1]. 


Pascal Large Scale Learning Challenge (http://largescale.first. 
fraunhofer.de) is a competition, that is designed to be fair and enables a direct 
comparison of current large scale classifiers. 

The KEEL-dataset repository (http://www.keel.es/dataset.php) pro- 
vides imbalanced data sets, multi-instance data sets, and multi- 
label data sets for evaluating algorithms. Examples of datasets for 
pedestrian detection are the MIT pedestrian dataset(http://cbcl.mit.edu/ 
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software-datasets/PedestrianData.html) and INRIA person dataset (http: 
//pascal.inrialpes.fr/data/human/). For human action recognition, the 
Weizmann action dataset and the ballet dataset (http://www.cs.sfu.ca/ 
research/groups/VML/semilatent/) are video sequences of different actions 
of many subjects. 


EEG data sets 

The EEG data set from BCI Competition 2003 (http: //www.bbci.de/ 
competition/) has 28 channel input recorded from a single subject perform- 
ing a self-paced key typing, that is, pressing with the index and little fingers 
corresponding keys in a self-chosen order and timing. 

EEGLAB (http://sccn.ucsd.edu/eeglab/) is an interactive MATLAB 
toolbox for processing continuous and event-related EEG, MEG and other elec- 
trophysiological data incorporating ICA, time/frequency analysis, artifact rejec- 
tion, event-related statistics, and several useful modes of visualization of the 
averaged and single-trial data. The Sleep-EDF database gives sleep recordings 
and hypnograms in European data format (EDF). 


Image databases 

Columbia Object Image Library (COIL-20) (http://www.cs.columbia. 
edu/CAVE/software/softlib/coil-20.php) contains the images of 20 different 
three-dimensional objects. The objects represent cups, toys, drugs and cosmetics. 
For each object 72 training samples are available. 

Some benchmark image databases are the Berkeley image segmen- 
tation database (http://www.eecs.berkeley.edu/Research/Projects/CS/ 
vision/bsds/), some brain MRI images from BrainWeb database 
(http: //www.bic.mni.mcgill.ca/brainweb/). In the Berkeley segmentation 
database, a natural color image (Training Image #124084) is a flower, which 
contains four dominant colors. A collection of data sets for the annotations of 
the video sequence is available at http: //www.vision.ee.ethz.ch/~bleibe/ 
data/datasets.html. 


Biometric databases 

The University of Notre Dame iris image dataset (http: //www.nd.edu/ 
~cvrl/CVRL/Data_Sets.htm1) contains 64,980 iris images obtained from 356 
subjects (712 unique irises) between January 2004 and May 2005. 

Hong Kong Polytechnic University (PolyU) palmprint database 
(http: //www.comp.polyu.edu.hk/~biometrics) includes 600  palmprint 
images with the size of 128 x 128 from 100 individuals, with 6 images from each. 

Some other biometric databases are the PolyU finger-knuckle-print 
databases (http://www4.comp.polyu.edu.hk/~biometrics/FKP.htm) and 
the CASIA gait database (data set B and C) (http://www.cbsr.ia. 
ac.cn/english/GaitDatabases.asp). 
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Data sets for one-class classification 
The intrusion detection data set (http://kdd.ics.uci.edu/databases/ 
kddcup99/kddcup99.htm1) consists in binary TCP dump data from seven weeks 
of network traffic. Each original pattern has 34 continuous features and 7 sym- 
bolic features. The training set contains 4, 898, 431 connection records, which are 
processed from about four gigabytes of compressed binary TCP dump data from 
seven weeks of network traffic. Another two weeks of data produced the test 
data with 311,029 patterns. The data set includes a wide variety of intrusions 
simulated in a military network environment. There are a total of 24 training 
attack types, and an additional 14 types that appear in the test data only. 
The promoter database (from UCI Repository) consists of 106 samples, 53 
for promoters, while the others for nonpromoters. 


Data sets for handwriting recognition 
The well-known real-world OCR benchmarks are the USPS data set, the MNIST 
data set, and the UCI Letter data set (from UCI Repository). 

The MNIST handwritten digits database (http://yann.lecun.com/ 
exdb/mnist/) consists of 60,000 training samples from approximately 250 writ- 
ers and 10,000 test samples from a disjoint set of 250 other writers. It contains 
784-dimensional nonbinary sparse vectors which resembles 28 x 28 pixel grey 
level images of the handwritten digits. 

The the US-Postal Service (USPS) handwritten digit database 
(http: //www.cs.nyu.edu/~roweis/data.html) contains 7,291 training and 
2,007 images of handwritten digits, size 16 x 16. 

The Pendigits data set (from UCI Repository) contains 7,494 training 
digits and 3,498 testing digits represented as vectors in 16-dimensional space. 
The digit database collects 250 samples from 44 writers. The samples written by 
30 writers are used for training, and the digits written by the other 14 are used 
for testing. 


Data sets for data mining 


The Reuters-21578 corpus (http://www.daviddlewis.com/resources/ 
testcollections/reuters21578/ is a set of 21578 economic news published by 
Reuters in 1987. Each article is typically designated into one or more semantic 
categories such as “earn”, “trade” and“corn”, where the total number of cate- 
gories is 114. The commonly used ModApte split filters out duplicate articles 
and those without a labeled topic, and then uses earlier articles as the training 
set and later articles as the test set. 

The 20 Newsgroups data set (http://people.csail.mit.edu/jrennie/ 
20Newsgroups/) is a collection of approximately 20,000 newsgroup documents, 
partitioned (nearly) evenly across 20 different newsgroups. This corpus contains 
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26, 214 distinct terms after stemming and stop word removal. Each document is 
then represented as a term-frequency vector and normalized to one. 

The CMU WebKB knowledge base (http://www-2.cs.cmu.edu/afs/ 
cs.cmu.edu/project/theo-11/www/wwkb/) is a collection of 8,282 web pages 
obtained from 4 academic domains. The web pages in the WebKB set are labeled 
using two different polychotomies. The first is according to topic and the second is 
according to web domain. The first polychotomy consists of 7 categories: course, 
department, faculty, project, staff, student and other. 

The OHSUMED data set (http://ir.ohsu.edu/ohsumed/ohsumed. 
html) is a clinically oriented MEDLINE subset formed by 348,566 references 
of 270 medical journals published between 1987 and 1991. It consists of 348, 566 
references and 106 queries with their respective ranked results. The relevance 
degrees of references with regard to the queries are assessed by humans, on 
three levels: definitely, possibly, or not relevant. Totally, there are 16, 140 query- 
document pairs with relevance judgments. 

The tr41 data set is derived from the TREC-5, TREC-6 and TREC-7 col- 
lections (http://trec.nist.gov). It includes 210 documents belonging to 7 
different classes. The dimension of this data set is 7, 454. 

The Spam data set (from the UCI Repository) contains 4,601 examples of 
emails, roughly 39% of which are classified as spam. There are 57 attributes for 
each example, most of which represent how frequently certain words or characters 
appear in the email. 


Databases and tools for speech recognition 


The YOHO speaker verification database consists of sets of 4 combina- 
tion lock phrases spoken by 168 speakers. This database can be purchased from 
Linguistic Data Consortium as LDC94S16. 

The Isolet spoken letter recognition database (from the UCI Reposi- 
tory) contains 150 subjects who spoke the name of each letter of the alphabet 
twice. The speakers are grouped into sets of 30 speakers each and are referred 
to as isolets 1 through 5. 

The TIMIT acoustic-phonetic continuous speech corpus contains a 
total of 6,300 sentences, 10 sentences spoken by 630 speakers selected from 8 
major dialect regions of the United States. 70% of the speakers are male, and 30% 
are female. It can be purchase from Linguistic Data Consortium as LDC93S1. 
The speech was labelled at both a phonetic and lexical level. 

The Oregon Graduate Institute telephone speech (OGI-TS) corpus is 
a multi-lingual speech corpus for LID experiments. The OGI-TS speech corpus 
contains the speech from 11 languages. It includes recorded utterances from 
about 2,052 speakers. 

The CALLFRIEND telephone speech corpus (http://www.1ldc.upenn. 
edu/Catalog/) is a collection of unscripted conversations for 12 languages 
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recorded over telephone lines. It is used in the NIST language recognition eval- 
uations (http://www.itl.nist.gov/iad/mig/tests/lang/) tasks, which are 
performed as language detection: Given a segment of speech and a language 
hypothesis, the task is to decide whether that target language was spoken in the 
given segment. The OGI-TS corpus and the CALLFRIEND corpus are widely 
used in language identification evaluation. 

HMM Tool Kit (http://htk.eng.cam.ac.uk/) is a de-facto standard toolkit 
in C for training and manipulating HMMs in speech research. The HMM-based 
speech synthesis system (HTS) (http: //hts-engine.sourceforge.net/) adds 
to HMM Tool Kit various functionalities in C for HMM-based speech synthe- 
sis. Some speech synthesis systems are Festival (http://www.cstr.ed.ac.uk/ 
projects/festival/), Flite (Festival-lite) (http: //www.speech.cs.cmu.edu/ 
flite/), and MARY text-to-speech system (http://mary.dfki.de/). 

The CMU_ARCTIC databases (http://festvox.org/cmu_arctic/) are 
phonetically balanced, U.S. English, single-speaker databases designed for speech 
synthesis research. The HTS recipes for building speaker-dependent and speaker- 
adaptive HTS voices use these databases. 

Some open-source speech processing systems are Speech Signal Processing 
Toolkit (http: //sp-tk.sourceforge.net/), STRAIGHT and STRAIGHTtrial 
(http: //www.wakayama-u.ac.jp/~kawahara/STRAIGHTadv/index_e.htm1), 
and Edinburgh Speech Tools (http://www.cstr.ed.ac.uk/projects/speech_ 
tools/). 


Data sets for microarray and for genome analysis 


The yeast sporulation data set (http://cmgm.stanford.edu/pbrown/ 
sporulation) is a microarray data set on the transcriptional program of sporu- 
lation in budding yeast. A DNA microarray containing 97% of the known and 
predicted genes is used. The total number of genes is 6,118. During the sporu- 
lation process, the mRNA levels were obtained at seven time points 0, 0.5, 2, 5, 
7, 9 and 11.5 h. The ratio of each gene’s mRNA level (expression) to its mRNA 
level in vegetative cells before transfer to the sporulation medium is measured. 

The human fibroblasts serum data set (http://www.sciencemag.org/ 
feature/data/984559.sh1) contains the expression levels of 8,613 human 
genes. It has 13 dimensions. A subset of 517 genes whose expression levels 
changed substantially across the time points have been chosen. 

The rat central nervous system data set (http://faculty.washington. 
edu/kayee/cluster) examines the expression levels of a set of 112 genes during 
rat central nervous system development over nine time points. 

The yeast cell cycle data set (http://faculty.washington.edu/kayee/ 
cluster) was extracted from a data set that shows the fluctuation of expression 
levels of approximately 6,000 genes over two cell cycles (17 time points). Out of 
these 6000 genes, 384 genes have been selected to be cell cycle regulated. 
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ELVIRA biomedical data set repository (http://leo.ugr.es/elvira/ 
DBCRepository/index.html) includes high-dimentional biomedical data sets, 
including gene expression data, protein profiling data and genomic sequence data 
that are related to classification. The colon cancer data set consists of 62 samples 
of colon epithelial cells from colon cancer patients. The samples consists of tumor 
biopsies collected from tumors (40 samples), and normal biopsies collected from 
healthy part of the colons (22 samples) of the same patient. The number of genes 
in the data set is 2000. 

Global cancer map (http://www. broadinstitute.org/cgi-bin/cancer/ 
datasets.cgi) is a gene expression data set consisting of 198 human tumor 
samples spanning 14 different cancer types. 

A normal/tumor gene expression data (http://www.molbio. 
princeton.edu/colondata) has a training set of 44 gene expression data 
profiles, 22 for normal gene profile data and 22 for tumor gene profile data. The 
testing set is composed of 18 tumor gene profile data. Each gene expression 
data is a 2000-dimensional vector. 


General databases 

GenBank (http://www.ncbi.nlm.nih.gov/Genbank/index.htm1) is the NIH 
genomic database, an annotated collection of all publicly available DNA 
sequences. It contains all annotated nucleic acid and amino acid sequences. Apart 
from presenting and annotating sequences, these databases offer many functions 
related to searching and browsing sequences. 

The Rfam database (http://rfam.sanger.ac.uk/) is a collection of RNA 
families, each represented by multiple sequence alignments, consensus secondary 
structures and covariance models. 

The EMBL nucleotide sequence database (EMBL-Bank) (http: 
//waw.ebi.ac.uk/emb1/) constitutes Europe’s primary nucleotide sequence 
resource. 

Stanford microarray database (http://genome-www5.stanford.edu/) 
and gene expression omnibus are the two most famous and abundant gene expres- 
sion databases in the world. Gene expression omnibus is a database including 
links to microarray-based experiments measuring mRNA, genomic DNA and 
protein abundances, as well as non-array techniques such as serial analysis of 
gene expression, and mass spectrometric proteomic data. 


Analysis tools 
Some websites for genome analysis are Human Genome Project (http://www. 
ornl.gov/sci/techresources/Human_Genome/home. shtml), Ensembl Genome 
Browser (http: //www.ensembl.org/index.htm1), and UCSC Genome Browser 
(http: //genome.ucsc.edu/). 

MeV (http://www.tm4.org/mev.html) is a versatile microarray tool, incor- 
porating sophisticated algorithms for clustering, visualization, classification, sta- 
tistical analysis and biological theme discovery. 
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For sequence analysis, BLAST (http://blast .ncbi.nlm.nih.gov/Blast. 
cgi) finds regions of similarity between biological sequences, and ClustalW2 
(http: //www.ebi.ac.uk/Tools/clustalw/) is a general-purpose multiple 
sequence alignment program for DNA or proteins. 

SignatureClust (http://infos.korea.ac.kr/sigclust.php) is a tool for 
landmark gene-guided clustering that enables biologists to get multiple views 
of the microarray data. 


Software 


Stuttgart Neural Network Simulator (http://www.ra.cs.uni-tuebingen.de/ 
SNNS/) is a software simulator for neural networks on Unix systems. The simultor 
kernel is written in C and and it provides X graphical user interface. The simula- 
tor supports the following network architectures and learning procedures that are 
discussed in this book: online BP, BP with momentum term and flat spot elimi- 
nation, batch BP, Quickprop, RProp, generalized RBF network, ART 1, ART 2, 
ARTMAP, cascade correlation, dynamic LVQ, BPTT, Quickprop through time, 
SOM, TDNN with BP, Jordan networks, Elman networks, and associative mem- 
ory. 

SHOGUN (http://www.shogun-toolbox.org) is an open-source toolbox in 
C++ that runs on UNIX/Linux platforms and interfaces to MATLAB. It pro- 
vides a generic interface to 15 SVM implementations (among them are SVMlight, 
LibSVM, GPDT, SVMLin, LibLinear, SVM SGD, SVMPegasos and OCAS, ker- 
nel ridge regression, SVR), multiple kernel learning, Naive Bayes classifier, k-NN, 
LDA, HMMs, C-means, hierarchical clustering. SVMs can be combined with 
more than 35 different kernel functions. One of SHOGUN’s key features is the 
combined kernel to construct weighted linear combinations of multiple kernels 
that may even be defined on different input domains. 

Dlib-ml (http: //dclib.sourceforge.net) provides a similarly rich environ- 
ment for developing machine learning software in C++. It contains an extensible 
linear algebra toolkit with built-in BLAS support. It also houses implementations 
of algorithms for performing inference in Bayesian networks and kernel-based 
methods for classification, regression, clustering, anomaly detection, and fea- 
ture ranking. MLPACK (http: //www.mlpack. org) is a scalable, multi-platform 
C++ machine learning library offering a simple, consistent API, high perfor- 
mance and flexibility. 

Netlab (http: //www1.aston.ac.uk/eas/research/groups/ncrg/ 
resources/netlab/) is another neural network simulator implemented in 
MATLAB. 

DOGMA (http://dogma.sourceforge.net) is a MATLAB toolbox for dis- 
criminative online learning. The library focuses on linear and kernel online algo- 
rithms, mainly developped in the relative mistake bound framework. Examples 
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are perceptron, passive-aggressive, ALMA, NORMA, SILK, projectron, RBP 
and Banditron. 

Some resources for implementing ELMs and RBF networks are: ELM 
(http://www.ntu.edu.sg/home/egbhuang/), optimally pruned ELM (http: 
//waw.cis.hut.fi/projects/tsp/index.php?page=O0PELM), and the improved 
LevenbergCMarquardt algorithm for RBF networks (http: //www.eng. auburn. 
edu/~wilambm/nnt/index.htm). 

A MATLAB toolbox for implementing several PCA tech- 
niques is available at http: //research.ics.tkk.fi/bayes/software/index. 
shtml. Some NMF tools are NMFPack (MATLAB, http://www.cs.helsinki. 
fi/u/phoyer/software.html), NMF package (C++, http://nmf.r-forge. 
r-project.org) and bioNMF (MATLAB, C, http://bionmf.cnb.csic.es). 

Some resources for implementing ICA are: JADE (http://www. 
tsi.enst .fr/icacentral/Algos/cardoso/), FastICA (http://www.cis.hut. 
fi/projects/ica/fastica/), efficient FastICA (http://itakura.kes.tul. 
cz/zbynek/downloads.htm), RADICAL (http://www.eecs.berkeley.edu/ 
~egmil/ICA), denoising source separation (http://www.cis.hut.fi/projects/ 
dss/). 

Some resources for implementing clustering are: SOM_PAK and 
LVQ_PAK (http://www.cis.hut.fi/~hynde/lvq/), Java applets for TSP 
based on SOM and Kohonen network (http://sydney.edu.au/engineering/ 
it/~irena/ai0l/nn/tsp.html, http://www.sund.de/netze/applets/som/ 
som2/), Java applet implementing several competitive learning based clustering 
algorithms (http://www.sund.de/netze/applets/gng/full/GNG-U_0.html), 
C++ code for minimum sum-squared residue coclustering algorithm 
(http: //www.cs.utexas.edu/users/dm1/Software/cocluster.htm1), 
and C++ code for single-pass fuzzy C-means and online fuzzy C-means 
(http: //www.csee.usf.edu/~hall/scalable). 

Some resources for implementing LDA are: uncorrelated LDA 
and orthogonal LDA (http://www-users.cs.umn.edu/~ jieping/UOLDA/), 
neighborhood component analysis (http://www.cs.berkeley.edu/~fowlkes/ 
software/nca/), local LDA (http: //sugiyama-www.cs.titech.ac.jp/~sugi/ 
software/LFDA/), semi-supervised local Fisher discriminant analysis (http: 
//sugiyama-www.cs.titech.ac.jp/~sugi/software/SELF). 

Some resources for implementing SVMs are: Lagrangian SVM (http: 
//www.cs.wisc.edu/dmi/lsvm), potential SVM (http://ni.cs.tu-berlin. 
de/software/psvm), LASVM (http://leon.bottou.com/projects/lasvm, 


http: //www.neuroinformatik.rub.de/PEOPLE/igel/solasvm), LS- 
SVM (http://www.esat.kuleuven.ac.be/sista/lssvmlab/), 2v- 
SVM (dsp.rice.edu/software), Laplacian SVM in the primal 
(http://sourceforge.net/projects/lapsvmp/), SimpleSVM (http: 


//sourceforge.net/projects/simplesvm/), decision-tree SVM (http: 
//ocrwks11.iis.sinica.edu.tw/dar/Download/WebPages/DTSVM.htm), core 
vector machine (http://c2inet.sce.ntu.edu.sg/ivor/cvm.html), OCAS 
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and OCAM in LIBOCAS (http://cmp.felk.cvut.cz/~xfrancv/ocas/htm1/, 
http://mloss.org/software/view/85/) and as a part of the SHOGUN 
toolbox, Pegasos (http://ttic.uchicago.edu/~shai/code), MSVM- 
pack in C (http://www.loria.fr/~lauer/MSVMpack/), BMRM in C++ 
(http: //users.cecs.anu.edu.au/~chteo/BMRM. html). . 

Some resources for implementing kernel methods are: 
regularized kernel discriminant analysis (http: //www.public. 
asu.edu/~ jye02/Software/DKL/) 
ing (http: //doc.ml.tu-berlin.de/nonsparse_mk1/, implemented 
within the SHOGUN toolbox), TRON and TRON-LR in LIBLIN- 
EAR (http://www.csie.ntu.edu.tw/~cjlin/liblinear), FaLKM- 
lib (http://disi.unitn.it/“segata/FaLKM-lib), SimpleMKL (http: 
//asi.insa-rouen.fr/enseignants/~arakotom/code/mklindex.htm1), 
HessianMKL  (http://olivier.chapelle.cc/ams/), Level MKL (http: 
//appsrv.cse.cuhk.edu.hk/~zlxu/toolbox/level_mk1.htm1, SpicyMKL 
(http: //www.simplex.t.u-tokyo.ac.jp/~s-taiji/software/SpicyMKL), 


Ly-norm multiple kernel learn- 


? 


generalized kernel machine toolbox (http://theoval.cmp.uea.ac.uk/“gcc/ 
projects/gkm). A selected collection of tutorials, publications, computer codes 
for Gaussian processes, mathematical programming, SVM and kernel methods 
can be found at http://www.kernel-machines. org. 


For Bayesian networks 

XMLBIF (XML-based BayesNets Interchange Format) is an XML-based format 
that is very simple to understand and yet can represent DAGs with probabilistic 
relations, decision variables and utility values. The XMLBIF format is imple- 
mented in the JavaBayes (http://www.cs.cmu.edu/~javabayes/) and GeNie 
(http: //genie.sis.pitt.edu/) systems. Netica (http://www.norsys.com) is 
a popular Bayesian network development software, widely used by the world’s 
leading companies and government agencies. 

FastInf (http://compbio.cs.huji.ac.il/FastInf) is a C++ library for 
propagation-based approximate inference methods in large-scale discrete undi- 
rected graphical models. Various message-scheduling schemes that improve on 
the standard synchronous or asynchronous approaches are included. FastInf 
includes exact inference by the junction-tree algorithm [2], loopy belief prop- 
agation, generalized belief propagation [8], tree re-weighted belief propagation 
[7], propagation based on convexification of the Bethe free energy [3], variational 
Bayesian, and Gibbs sampling. All methods can be applied to both sum and max 
product propagation schemes, with or without damping of messages. 

libDAI (http://www.libdai.org) is an open-source C++ library that pro- 
vides implementations of various exact and approximate inference methods for 
graphical models with discrete-valued variables. libDAI uses factor graphs. Apart 
from exact inference by brute force enumeration and the junction-tree method, 
libDAI offers the following approximate inference methods for calculating parti- 
tion sums, marginals and MAP states: mean field, (loopy) belief propagation, tree 
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expectation propagation [4], generalized belief propagation [8], loop-corrected 
belief propagation [5], a Gibbs sampler, and several other methods. In addition, 
libDAI supports parameter learning of conditional probability tables by ML or 
EM (in case of missing data). 

Some resources for implementing Bayesian and probabilistic net- 
works are: Murphy’s Bayes Network Toolbox (in MATLAB, http://code. 
google.com/p/bnt/), Probabilistic Networks Library (http://sourceforge. 
net/projects/openpnl), GRMM (http://mallet.cs.umass.edu/grmm), Fac- 
torie (http://code.google.com/p/factorie), Hugin (http://www.hugin. 
com), and an applet showcasing common Markov chain algorithms (http: 
//wiw.lbreyer.com/classic.html). 


For reinforcement learning 
RL-Glue (http://glue.rl-community.org) is a language-independent. soft- 
ware for reinforcement-learning experiments; it provides a common interface 
for a number of software and hardware projects in the reinforcement-learning 
community. RL-Glue has been ported to a number of languages including 
C/C++/Java/Matlab via sockets. 

Libpgrl (http: //code.google.com/p/libpgr1/) implements both model-free 
reinforcement learning and policy search algorithms, though not any model-based 
learning. Libpgrl is efficient in a distributed reinforcement learning environment. 
Libpgrl is a fast C++ implementation that has abstract classes to model a subset 
of reinforcement learning. 

The MATLAB Markov Decision Process Toolbox (http://www.inra.fr/mia/ 
T/MDPtoolbox/) implements only a few basic algorithms such as tabular Q- 
learning, SARSA and dynamic programming. Some resources on reinforcement 
learning are available at http: //www-all.cs.umass.edu/rlr/. 


Other resources 
CVX (http://cvxr.com/cvx/) isa MATLAB-based modeling system for convex 
optimization. 

SDPT3 (http://www.math.nus.edu.sg/~mattohkc/sdpt3.html) is an SDP 
solver. The MATLAB function fmincon is an SQP solver with a quasi-Newton 
approximation to the Hessian of the Lagrangian using the BFGS method. 

SparseLab(http://sparselab.stanford.edu/) isa MATLAB software pack- 
age for sparse solutions to systems of linear equations. 

Resources on random forests is available at http://www.math.usu.edu/ 
~adele/forests/. The Weka machine learning archive (http://www.cs. 
waikato.ac.nz/ml/weka/) offers a java implementation of random forests. The 
classification results for bagging and boosting can be obtained using Weka on 
identical training and test sets. 

The MultiBoost package (http://www.multiboost.org/) provides a fast 
C++ implementation of multi-class/multi-label /multitask boosting algorithms. 
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C5.0 (http: //www.rulequest .com/see5-info.html) is a sophisticated data 


mining tool in C for discovering patterns that delineate categories, assembling 
them into classifiers, and using them to make predictions. 


Resources on GPU can be found at http://www.nvidia.com, http://www. 


gpgpu.org/. 


B. 


1 Download Netica (http://www.norsys.com). Practise using the software 


for Bayesian network development. 
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C-median clustering, 275 
K-RIP, 53 
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p-norm, 817 

t-conorm, 684 

t-norm, 684 

modus ponens, 689 

modus tollens, 689 


activation function, 5 

active learning, 19 
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adaline model, 71 

adaptive neural network, 10 
agglomerative clustering, 282, 283 
aggregation, 685 


Akaike information criterion (AIC), 29 


all-points representation, 283 
also operator, 692 

analogical learning, 18 

ANFIS model, 719 

antecedent, 693 

anti-Hebbian learning rule, 387 
anti-Hebbian rule, 388 

APEX algorithm, 388 
approximate reasoning, 689 
Armijo’s condition, 109 


ART 1, 248 

ART 2, 248 

ART model, 247 

ART network, 246 

ARTMAP model, 248 
association rule, 632 
asymmetric PCA, 406 
asymptotical upper bound, 201 
asynchronous or serial update mode, 166 
attractor, 825 

autoassociation, 194 
autoassociative MLP, 393 
autoregressive (AR) model, 360 
average fuzzy density, 295 
average storage error rate, 201 


backpropagation learning, 88 

backward substitution, 820 

bag-of-words model, 782 

bagging, 657 

basic probability assignment function, 666 
basis pursuit, 56 

batch learning, 93 

batch OLS, 323 

Bayes optimal, 469 

Bayes’ theorem, 590 

Bayesian decision surface, 238 

Bayesian information criterion (BIC), 29 
Bayesian network, 592 

Bayesian network inference, 603 
Bayesianism, 590 

belief function, 667 

belief propagation, 604, 606 

BFGS method, 140 

bias-variance dilemma, 25, 31 
bidirectional associative memory (BAM), 195 
bifurcation, 183 

bifurcation parameter, 183 

binary neural network, 47 

binary RBF, 48 

binary RBF network, 48 

bipolar coding, 196 

BIRCH, 286 
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Boolean function, 47 

Boolean VC dimension, 42 
boosting, 653 

bottleneck layer, 209, 393 

BP through time (BPTT), 359 
BP with global descent, 117 

BP with momentum, 91 

BP with tunneling, 117 
bracketing, 832 
brain-states-in-a-box (BSB), 195 
Bregman distance, 402 

Brent’s quadratic approximation, 832 
Broyden family, 140 

Broyden’s approach, 136 


canonical correlation analysis, 410 
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Cartesian product, 685 
cascade-correlation, 333 

Cauchy annealing, 173 

Cauchy distribution, 827 
Cauchy machine, 173, 633 
Cauchy-Riemann equations, 153 
cellular neural network, 186 
CHAMELEON, 286 

chaotic, 181 

chaotic neural network, 181 
chaotic simulated annealing, 183 
character recognition, 145 
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city block metric, 40 

classical Newton’s method, 134 
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cloning templates, 187 

Cloud computing, 753 

uster analysis, 226 
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uster separation, 294 
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ustering feature tree, 287 
ustering tree, 282 

CMAC, 341 

coclustering, 300 
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combinatorial optimization problem, 175 
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competitive Hebbian learning, 245, 246 
competitive learning, 226 

competitive learning network, 226 
complement, 678 

complete linkage, 283 

completeness, 709 

complex fuzzy logic, 698 

complex fuzzy set, 698 

complex RBF network, 337 
complex-valued ICA, 455 

complex-valued membership function, 698 
complex-valued MLP, 153 
complex-valued multistate Hopfield network, 
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complex-valued PCA, 402 
compositional rule of inference, 689 





compressed sensing, 53 

computational learning theory, 40 
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(CUDA), 749 

concave fuzzy set, 679 
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concurrent neurofuzzy model, 718 

condition number, 821 

conditional FCM, 280 

conditional independence, 591 

conditional independence test, 597 

conditional probability table (CPT), 592 

conic-section function network, 340 

conjugate-gradient (CG) method, 142 

conjunction, 683 

connectionist model, 1 

conscience strategy, 271 

consequent parameter, 720 

consistency, 709 

constrained ICA, 450 

constraint-satisfaction problem, 629 

constrainted PCA, 398 

content-addressable memory, 194 

content-based image retrieval, 803 

content-based music retrieval, 805 

continuity, 709 

contrast function, 437 

convex fuzzy set, 678 

cooperative neurofuzzy model, 718 

core, 678 

correspondence analysis, 789 

coupled PCA, 386 

Cramer-Rao bound, 437 

crisp silhouette, 296 

cross-coupled Hebbian rule, 407 

crosstalk, 197 

crossvalidation, 27 

Cumulant, 828 

cumulative distribution function (cdf), 826 

CURE, 286 
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curse of dimensionality, 21, 715 
curve-fitting, 21 


d-separation, 592 

Dale’s law, 80 

data visualization, 228 

data whitening, 823 
Davidon-Fletcher-Powell (DFP) method, 140 
DBSCAN, 286 

De Morgan’s laws, 685 

dead-unit problem, 270 

deep Bayesian network, 633 

deflation transformation, 409 
defuzzification, 692, 694, 695 
Delaunay triangulation, 245 
delearning rate, 272 

delta rule, 88 

delta-bar-delta, 110 

demixing matrix, 436 
Dempster-Shafer theory of evidence, 666 
dendrogram, 284 

density-based clustering, 282 
deterministic annealing, 173, 275 
deterministic finite-state automaton, 714 
deterministic global-descent, 117 
dichotomy, 47 

differential entropy, 437 

directed acyclic graph (DAG), 592 
discrete Fourier transform (DFT), 180 
discrete Hartley transform, 180 
disjunction, 683 

distributed SVM, 525 

divisive clustering, 282 
dual-orthogonal RBF network, 360 
Dyna, 581 

dynamic Bayesian network, 617 





early stopping, 23 

EASI, 441 

echo state network, 364 

EEG, 458 

eigenstructure learning rule, 206 

eigenvalue decomposition, 818 

EKF-based RAN, 335 

elastic ring, 235 

Elman network, 362 

empirical risk minimization (ERM) principle, 
43 

empty set, 678 

energy function, 825 

ensemble learning, 613, 651 

epoch, 16 

equality, 680 

equilibrium point, 825 

Euler-discretized Hopfield network, 184 

excitation center, 230 


expectation-maximization (EM) algorithm, 
618 

expected risk, 43 

exploration-exploitation problem, 577 

exponential correlation associative memory, 
209 

extended Hopfield model, 178 

extended Kalman filtering (EKF), 146 

extension principle, 685 


factor analysis, 375 

factor graph, 606 

factorizable RBF, 317 

familiarity memory, 195 

fast-learning mode, 248 

FastICA, 442 

feature extraction, 37 

feature selection, 36 

Fibonacci search, 832 

final prediction error, 29 

FIR neural network, 355 

first-order TSK model, 697, 711 

Fisher’s determinant ratio, 470 

Fisherfaces, 473 

flat-spot problem, 39, 106 

fMRI, 458 

frame of discernment, 666 

frequency-sensitive competitive learning 
(FSCL), 271 

frequentist, 590 

full mode, 248 

fully complex BP, 154 

function counting theorem, 47, 201 

functional data analysis, 57 

fundamental memory, 196 

fuzzification, 692, 694 

fuzzy C-means (FCM), 256 

fuzzy C-median clustering, 275 

fuzzy annealing, 173 

fuzzy ARTMAP, 248 

fuzzy ASIC, 748 

fuzzy BP, 728 

fuzzy clustering, 256 

fuzzy complex number, 698 

fuzzy controller, 692 

fuzzy coprocessor, 748 

fuzzy covariance matrix, 295 

fuzzy density, 282 

fuzzy graph, 686 

fuzzy hypervolume, 282, 295 

fuzzy implication, 687 

fuzzy implication rule, 693 

fuzzy inference engine, 692 

fuzzy inference system, 692 

fuzzy interference, 693 

fuzzy mapping rule, 693 
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fuzzy matrix, 686 

fuzzy min-max neural network, 729 
fuzzy neural network, 718 
fuzzy number, 679 

fuzzy partition, 677 

fuzzy perceptron, 79 
fuzzy reasoning, 688, 695 
fuzzy relation, 685 

fuzzy rule, 693 

fuzzy set, 677 

fuzzy shell thickness, 296 
fuzzy silhouette, 296 
fuzzy singleton, 680 
fuzzy subset, 680 

fuzzy transform, 681 


gain annealing, 179 

Gardner algorithm, 204 

Gardner conditions, 204 

Gauss-Newton method, 135 

Gaussian distribution, 826 

Gaussian machine, 632 

Gaussian RBF network, 316 

general position, 47 

generalization, 21 

generalization error, 22 

generalized modus ponens, 689 

generalized binary RBF, 48 

generalized delta rule, 88 

generalized eigenvalue, 404 

generalized EVD, 404 

generalized Hebbian algorithm, 380 

generalized Hebbian rule, 196, 207 

generalized Hopfield network, 201 

generalized linear discriminant, 342 

generalized LVQ, 260 

generalized RBF network, 320 

generalized secant method, 136 

generalized sigmoidal function, 98 

generalized single-layer network, 342 

generalized SVD, 477 

generic fuzzy perceptron, 729 

Gibbs sampling, 611 

Givens rotation, 820, 821 

Givens transform, 821 

global descent, 117 

globally convergent, 111 

gloden-section search, 832 

Gram-Schmidt orthonormalization, 824 

Gram-Schmidt orthonormalization (GSO), 
37 

granular computing, 701 

graph-theoretical technique, 285 

Graphic processing unit (GPU), 749 

graphical model, 591 

grid partitioning, 715 
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Guassian RBF network, 327 
guillotine cut, 716 
Gustafson-Kessel clustering, 259 


Hamming associative memory, 210 
Hamming decoder, 210 

Hamming distance, 197 

Hamming network, 210 
handwritten character recognition, 155 
hard limiter, 70 
hardware/software codesign, 740 
Hebbian rule, 371 

Hebbian rule with decay, 197 
Hecht-Nielsen’s theorem, 50 
hedge, 680 

height, 677 

heteroassociation, 194 

hidden Markov model (HMM), 614 
hierarchical clustering, 283 
hierarchical fuzzy system, 716 
hierarchical RBF network, 333 
higher-order statistics, 437 
Ho-Kashyap rule, 79 

Hopfield model, 165 

Householder reflection, 821 
Householder transform, 820, 821 
Huber’s function, 33 

Huffman coding, 271 

hyperbolic tangent, 70 
hyperellipsoid, 327 
hyperellipsoidal cluster, 259 
hypersurface reconstruction, 21 
hypervolume, 295 

hypothesis space, 45 


ICA network, 447 
ill-conditioning, 824 
ill-posedness, 28 

image compression, 390 

image segmentation, 224 
imbalanced data, 59 

importance sampling, 610 
inclusion, 680 

incremental C-means, 250 
incremental LDA, 478 
incremental learning, 93 
independent factor analysis, 624 
independent subspace analysis, 445 
independent vector analysis, 452 
induced Delaunay triangulation, 245 
inductive learning, 18 

influence function, 33 

infomax, 439 

interactive-or (i-or), 710 
intercluster distance, 283 
interpretability, 709 
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intersection, 683 

interval neural network, 731 
inverse fuzzy transform, 681 
inverse Hebbian rule, 204 

inverse reinforcement learning, 578 
ISODATA, 292 


JADE, 441 
jitter, 25 


Karhunen-Loeve transform, 373 
kernel, 678 

kernel autoassociator, 560 
kernel CCA, 560 

kernel ICA, 560 

kernel LDA, 556 

kernel PCA, 552 

Kirchhoff’s current law, 168 
KKT conditions, 494, 830 
Kohonen layer, 229 

Kohonen learning rule, 230 
Kohonen network, 229 
Kolmogorov’s theorem, 50 
Kramer’s nonlinear PCA network, 394 
Kullback-Leibler divergence, 828 
kurtosis, 437 


Lagrange multiplier method, 831 
LASSO, 55 

latent semantic indexing, 786 

latent variable model, 376 

lateral othogonaliztion network, 408 
layerwise linear learning, 151 

LBG, 250 

LDA network, 405, 481 

leaky learning, 271 

learning, 21 

learning automata, 586 

learning vector quantization (LVQ), 237 
least squares, 321 

leave-one-out, 27 

left principal singular vector, 408 

left singular vector, 819 
Levenberg-Marquardt (LM) method, 136 
likelihood function, 826 

line search, 133, 832 

linear associative memory, 195 

linear discriminant analysis (LDA), 469 
linear LS, 817 

linear scaling, 823 

linear threshold gate (LTG), 42 
linearly inseparable, 49 

linearly separable, 49 

linguistic variable, 676 

Liouville’s theorem, 153 

Lipschitz condition, 117, 825 


liquid state machine, 364 

LM with adaptive momentum, 138 
LMS algorithm, 74 

localized ICA, 454 

localized PCA, 400 
location-allocation problem, 177 
logistic function, 33, 70 

logistic map, 184 

long-term memory, 196 

look-up table, 741 

loopy belief propagation, 605 

loss function, 33 

lotto-type competitive learning, 273 
LS-SVM, 501 

LTG network, 47 

Lyapunov function, 169, 825 
Lyapunov theorem, 825 
Lyapunov’s second theorem, 825 


madaline model, 75 

Mahalanobis distance, 280 
Mamdani model, 694 
MapReduce, 753 

Markov blanket, 593 

Markov chain, 829 

Markov chain Monte Carlo (MCMC), 609 
Markov network, 591 

Markov process, 829 

Markov random field, 591 
Markov-chain analysis, 829 
matrix completion, 55 

matrix inversion lemma, 822 
max-min composition, 686, 695 
max-min model, 695 
max-product composition, 695 
maximum absolute error, 38 
maximum-entropy clustering, 275 
MAXNET, 210 

Mays’ rule, 79 

McCulloch-Pitts neuron, 5 

MDL principle, 30 

mean absolute error, 38 
mean-field annealing, 629 
mean-field approximation, 612, 629 
mean-field-theory machine, 629 





median of the absolute deviation (MAD), 35 


median RBF, 331 

median squared error, 38 

MEG, 458 

membership function, 682 

Metropolis algorithm, 171 
Metropolis-Hastings method, 609 
Micchelli’s interpolation theorem, 314 
min-max composition, 686 

minimal disturbance principle, 339 
minimal RAN, 335 
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minimum description length (MDL), 29 

minimum spanning tree (MST), 285 

Minkowski-r metric, 40 

minor component analysis (MCA), 395 

minor subspace analysis, 397 

mixture model, 620 

mixture of experts, 650 

ML estimator, 33, 826 

MLP-based autoassociative memory, 208, 
209 

model selection, 27 

modus tollens, 688 

momentum factor, 91 

momentum term, 91 

Moore-Penrose generalized inverse, 816 

mountain clustering, 253 


multi-valued recurrent correlation associative 


memory, 209 
multilevel grid structures, 716 
multilevel Hopfield network, 185 
multilevel sigmoidal function, 185 
multiple correspondence analysis, 789 
multiple kernel learning, 563 
multiplicative ICA model, 453 
multistate Hopfield network, 185, 207 
multistate neuron, 185 
multivalued complex-signum function, 185 
mutual information, 437 


naive mean-field approximation, 631 
NARX model, 355 

natural gradient, 150, 441 
natural-gradient descent method, 150 
nearest-neighbor paradigm, 225, 283 
negation, 683, 685 

negentropy, 437 

neighborhood function, 230 

network pruning, 24 

neural gas, 243 

neurofuzzy model, 728 

Newton’s direction, 138 

Newton’s method, 134 
Newton-Raphson search, 832 
no-free-lunch theorem, 46 

noise clustering, 276 
non-Gaussianity, 437 

non-normal fuzzy set, 677 

nonlinear discriminant analysis, 481 
nonlinear ICA, 449 

nonlinearly separable, 49 
nonnegative ICA, 451 

normal fuzzy set, 677 

normalized Hebbian rule, 371 
normalized RBF network, 332 
normalized RTRL, 358 
NP-complete, 38, 175 
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Occam’s razor, 58 

Ohm’s law, 168 

Oja’s rule, 372 

one-neuron perceptron, 70 
one-step secant method, 141 
ontology, 702 

optimal brain damage (OBD), 101 
optimal brain surgeon (OBS), 101 
optimal cell damage, 101 
orthogonal least squares (OLS), 323 
orthogonal matching pursuit, 56 
orthogonal Oja, 397 

orthogonal summing rule, 668 
orthogonalization rule, 388 

outer product rule of storage, 196 
outlier, 32 

outlier mining, 782 

overcomplete ICA, 436 
overfitting, 21 

overlearning problem, 439 
overpenalization, 273 
overtraining, 23 


PAC learnable, 45 

parallel SVM, 525 

partial least squares, 822 

particle filtering, 610 

partitional clustering, 282 

Parzen classifier, 316 

PASTd, 383 

pattern completion, 627 
perceptron, 70 

perceptron convergence theorem, 72 
perceptron learning algorithm, 72 
perceptron-type learning rule, 198 
permutation ambiguity, 452 
perturbation analysis, 101 
plausibility function, 667 

pocket algorithm, 74 

pocket algorithm with ratchet, 74 
pocket convergence theorem, 74 
point-symmetry distance, 280 
Polak-Ribiere CG, 143 
polynomial kernel, 490, 552 
polynomial threshold gate, 49 
positive-definite, 327 

possibilistic C-means (PCM), 277 
possibility distribution, 676 
postprocessing, 35 

Powell’s quadratic-convergence search, 832 
power of a fuzzy set, 681 
premature saturation, 106 
premise, 693 

premise parameter, 719 
preprocessing, 35 

prewhitening, 448 
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principal component, 373 Riemannian metric, 150 
principal curves, 393 right principal singular vector, 408 
principal singular component, 408 right singular vector, 819 
principal singular value, 407 RISC processor, 748 
principal subspace analysis, 377 rival penalized competitive learning (RPCL), 
principle of duality, 685 272 
probabilistic ICA, 624 RLS method, 321 
probabilistic neural network, 316 Robbins-Monro conditions, 225, 370 
probabilistic PCA, 621 robust BP, 118 
probabilistic relational model, 594 robust clustering, 275 
probably approximately correct (PAC), 45 robust learning, 32 
Product of two fuzzy sets, 680 robust PCA, 393 
progressive learning, 339 robust RLS algorithm, 385 
projected clustering, 298 robust statistics, 275 
projected conjugate gradient, 496 rough set, 701 
projection learning rule, 198 Rubner-Tavan PCA, 387 
projective NMF, 428 rule extraction, 713 
pseudo-Gaussian function, 317 rule generation, 713 
pseudoinverse, 816 rule refinement, 713 
pseudoinverse rule, 197 
pulse width modulation, 739 sample complexity, 45 

scale estimator, 33 
QR decomposition, 820 scaled CG, 143 
quadratic programming, 495 scatter partitioning, 715 
quantum associative memory, 206 scatter-points representation, 283 
quasi-Newton condition, 140 search-then-converge schedule, 108 
quasi-Newton method, 138 secant method, 139, 832 
Quickprop, 111 secant relation, 140 

second-order learning, 133 
Radial basis function, 316 sectioning, 832 
rank-two secant method, 140 self-creating mechanism, 292 
Rayleigh coefficient, 470 self-organizing map (SOM), 227 
Rayleigh quotient, 819 semantic web, 794 
RBF-AR model, 360 semi-supervised learning, 19 
RBF-ARX model, 360 semidefinite programming, 832 
real-time recurrent learning (RTRL), 358 sensitivity analysis, 99 
recollection, 195 sequential minimal optimization, 497 
recurrent BP, 358 sequential simplex method, 831 
recurrent correlation associative memory, 208 shadowed sets, 702 
recurrent MLP, 358 shell clustering, 296 
recursive least squares (RLS), 149 shell thickness, 296 
recursive OLS, 324 Sherman-Morrison-Woodbury formula, 822 
regularization, 24 short-term memory, 196 
regularization network, 312 short-term memory steady-state mode, 248 
regularization technique, 102 sigmoidal function, 70 
regularized forward OLS, 324 sigmoidal membership function, 710 
regularized LDA, 474 sigmoidal RBF, 317 
reinforcement learning, 19 sign-constrained perceptron, 80 
relevance vector machine, 534 signal counter, 292 
representation layer, 393 simple competitive learning, 225 
reservoir computing, 363 simulated annealing, 171 
resilient propagation (RProp), 119 simulated reannealing, 173 
resistance-capacitance model, 355 single linkage, 283 
resource-allocating network (RAN), 334 single-layer perceptron, 71 
restricted Boltzmann machine, 633 singular value, 819 
retrieval stage, 199 singular value decomposition, 819 
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singular value decomposition (SVD), 405 
slack neuron, 178 

slow feature analysis, 458 

small sample size problem, 472 
softcompetition, 260 
softcompetition scheme, 259 
softcompetitive learning, 274 
sparse approximation, 54 

sparse ICA, 454 

sparse PCA, 399 

sparsity, 26 

spherical-shell thickness, 282 

split complex BP, 153 
split-complex EKF, 155 
split-complex RProp, 155 

spurious state, 204, 632 

square ICA, 436 

stability, 26 

stability-plasticity dilemma, 246 
stability-speed problem, 383 
standard normal cdf, 826 

standard normal distribution, 826 
STAR C-means, 273 

stationary distribution, 829 
stationary subspace analysis, 457 
statistical thermodynamics, 171 
stochastic approximation theory, 370 
stochastic relaxation principle, 259 
Stone-Weierstrass theorem, 51 
storage capability, 200 

structural risk-minimization (SRM), 489 
structured data analysis, 789 
Student-t models, 827 
sub-Gaussian, 438 

sublinear, 34 

subspace learning algorithm, 376 
subthreshold region, 744 
subtractive clustering, 253 
successive approximative BP, 116 
sum-product algorithm, 604 
super-Gaussian, 438 

SuperSAB, 110 

supervised clustering, 279 
supervised learning, 17 

supervised PCA, 401 

support, 677 

support vector, 489, 491 

support vector clustering, 522 
support vector ordinal regression, 521 
support vector regression, 517 
symmetry-based C-means, 280 
synchronous or parallel update mode, 167 
Systolic array, 751 

systolic array, 739 


Takagi-Sugeno-Kang (TSK) model, 697 
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Talwar’s function, 33 

tanh estimator, 33 

tao-robust BP, 118 

tapped-delayed-line memory, 355 

Taylor-series expansion, 147 

template matrix, 186 

temporal association network, 351 

temporal-difference learning, 581 

terminal attractor, 117 

terminal attractor-based BP, 117 

terminal repeller unconstrained subenergy 
tunneling (TRUST), 117 

thin-plate spline, 317 

Thomas Bayes, 590 

three-term BP, 92 

time-delay neural network, 354 

time-dependent recurrent learning (TDRL), 
358 

topology-preserving, 228 

topology-preserving network, 243 

topology-representing network, 246 

total least squares (TLS), 370 

trained machine, 43 

transfer learning, 17 

transition matrix, 829 

traveling salesman problem (TSP), 175 

tree partitioning, 715 

triangular conorm (t-conorm), 683 

triangular norm (t-norm), 683 

truncated BPTT, 359 

trust-region search, 133 

Tucker decomposition, 406 

Turing equivalent, 47 

Turing machine, 354 

two-dimensional PCA, 403 

type-n fuzzy set, 681 


uncertainty function, 667 
uncorrelated LDA, 475 
undercomplete ICA, 436 
underpenalization, 273 
underutilization problem, 270 
union, 683 

universal approximation, 51, 87, 353 
universal approximator, 628, 692 
universal Turing machine, 47 
universe of discourse, 676 
unsupervised learning, 18 

upper bound, 200 


validation set, 23, 27 
Vapnik-Chervonenkis (VC) dimension, 41 
variable-metric method, 139 

variational Bayesian, 612 

VC confidence, 43 

vector norm, 817 
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vector quantization, 224 
vector space model, 782 
vigilance test, 249 
Voronoi diagram, 225 
Voronoi set, 225 
Voronoi tessellation, 225 


wavelet neural network, 342 
weak-inversion region, 744 
weight initialization, 112 
weight scaling, 105, 152 
weight sharing, 25 

weight smoothing, 103 
weight-decay technique, 24, 102, 149 
weighted Hebbian rule, 197 
weighted SLA, 377 
weighted-mean method, 694 
Widrow-Hoff delta rule, 75 
winner-takes-all (WTA), 227 
winner-takes-most, 274 
Wolfe’s conditions, 832 


zero-order TSK model, 697 
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